Hi! I have a few questions that are more about spaCy than Prodigy. I beg a pardon if they are irrelevant, but I just haven't found where else to ask (I left a question on Stack, but it remains unaswered, so I leave it here almost unchanged).
I need to annotate a corpus of international relations/policy articles, for which I use spaCy (with Prodigy atop). The default English models come with a set of pre-defined entity types, most of which are, in theory, easily applicable to my purposes. However, the only piece of documentation I found is just a table with very short descriptions, which does not answer the (quite numerous) questions I faced while working on the annotation.
So my primary question is: aren't there bigger accurate guides/documentation on all these entity types (e.g. NORP, GPE and so on), or at least an extensive set of examples? I simply fear that I might've been searching incorrectly all of this time.
I also suspect that the generalized guides might be considered rather dull by some, cause they would not address many very individual cases, but I also think that having them might make basics for the newcomers (like me) a lot easier.
And in case such documentation is nowhere to be found, I would appreciate if someone could help at least with the most important questions (I consider them too small to open separate topics for each, but I also might be wrong):
- In case where the name of something is followed by an abbreviation (see an example below), should it be considered one or two separate entities? What could define my choice?
Non-proliferation treaty (NPT), which contains the only binding commitment to nuclear disarmament in a multilateral treaty (...)
- Similarly, when one phrase implies two entities, but they are not 100% separated syntactically, how can I capture both entities correctly? Say, with the following example, which indicates two separate events:
concluding documents of the Madrid and Vienna conferences
- There are certain cases of ambiguity, e.g. 'Kyoto' may refer to the protocols just as well as to the town:
(...) undermines most points of the Kyoto.
- Finally, there's a question I think I found an answer to. I asked the following: "Is NORP only meant for tagging names of national/ethnic groups, or is it also used when an adjective indicates that another entity belongs to some nation/political/religious group? So in 'Iranian nuclear program', it is ok to tag 'Iranian' as NORP?" So, judging by this example, my assumption was true.
These are all very good questions. I was bothered a lot by this type of issue during a research project, and we had a student who was working on a better named entity scheme with nested annotations. Unfortunately the ML community does often go with a "worse is better" approach: the flat entity scheme is good enough for most situations, and it makes a number of things simpler. So we accept that some entity mentions have no convincing answer, in order to work with flat, non-overlapping spans as they're easier to program with.
Since the annotation scheme has these edge-case problems, it's difficult to reason about what the right answer should be in some situations. What we want to do is match the policies used in the training corpus. The OntoNotes 5 manual is helpful for this, but the ultimate answer comes down to what the current model has learned from the annotations.
I therefore usually try to probe the model's current policies by trying to find especially simple cases, and showing them to the model. For instance, the model is sure to get the following case "right", according to its annotation scheme: https://explosion.ai/demos/displacy-ent?text=International%20Business%20Machines%20(IBM)
&model=en_core_web_lg&ents=person%2Corg%2Cgpe%2Cloc%2Cproduct%2Cnorp%2Cdate%2Cper%2Cmisc . Another example: https://explosion.ai/demos/displacy-ent?text=United%20Nations%20(UN).
This indicates that you should put the acronyms into a separate entity. This matches with my own intuition -- I think the scheme is better like that. The NER annotations will be easier to learn if you have fewer unique phrases that can be mentions of the entity.
For point 2, you're kind of stuck. There's no good answer to this. Try to avoid entities that have too much syntactic interaction, and prefer entities that work as syntactic units. If this type of entity is common in your data, you might try having "conferences" as the target word and a fix-up rule to expand the boundaries based on the dependency parse. Otherwise, you could label the whole span as an entity...Which is sure to confuse your model. I would try to exclude this from your training data.
For point 3, this is a problem often referred to as "metonymy". It makes detailed entity linking very difficult, and you'll never resolve it properly. Try to design around requiring this if possible.
For point 4, yes, "Iranian" there will be a NORP. This is an English-specific quirk of the NER scheme. The adjectival forms are capitalised in English, and so we compromise and call them "entities", even though they really aren't.
Thank you very much for the answers!
As for point 2 - I will most probably leave that sentence (and the likes) out. But just to make it clear, if I wanted to utilize the solution with the fix-up rule, I'd have to do enough manual POS and dependency tagging of the similar examples manually first, and then train a model? Would it somehow help in recognizing two overlaying entities as separate ones? I'm quite new to NLP, so while it seems to me that I understand the whole idea, the sequence is rather blurry.
And btw, may I answer may question on Stack (and add a link to your post)?
Sure, that's a good idea! Really appreciate that you asked on SO first btw (and sorry we don't always get around to answering all questions there)
Hi there! I would appreciate if you could help me out with some questions concerning SpaCy. We are about to run a NER project and can't decide on the way to use DATE label (one of the standard labels proposed by SpaCy). The description says that the label stand for "absolute or relative dates or periods". While experimenting with our corpora, we encountered two issues that we are trying to figure out the answer now.
For example, in the collocation from 16 to 18 of November 1995 do we label as DATE only "November 1995", because otherwise it may be confusing?
In the collocation "in the next three years" do we label only "next three years", "three years" or the whole collocation?
from 16 to 18 of November 1995 are challenging for a sequence tagging models of NER --- they basically break the simplifying assumption that entities must be contiguous flat spans of text. Because they break a fundamental assumption, there's not really going to be a good answer for them.
I would probably try to craft policies that follow the behaviour of the current models around these things, because you want to do whatever the original annotators of the training data did (as there's no really consistent answer). You can run the model on text to find out --- the online demo can also be helpful: https://explosion.ai/demos/displacy-ent?text=Prices%20from%2016%20to%2018%20of%20November%201995%20stayed%20stable%2C%20but%20dropped%20precipitously%20afterwards.&model=en_core_web_sm&ents=person%2Corg%2Cgpe%2Cloc%2Cproduct%2Cnorp%2Cdate%2Cper%2Cmisc%2Cwork_of_art%2Clanguage%2Cevent%2Ctime%2Cordinal%2Cquantity%2Cmoney%2Cpercent%2Ccardinal
Another way would be to look at the annotation manual of the OntoNotes corpus. I can't find detailed discussion of the NER policies though, so it might be best to look at how examples are handled by the models.
I would say that the annotation decision in the example I linked above is questionable, but probably not the worst decision. It at least lets you recover the information, especially by cross-referencing with the dependency parse.