I have a "problem" with NER labeling. I have a number of phrases for the same concept, something like
"US has decided"
"US administration has decided"
"White House has decided"
"Washington has decided" (often capitals names are used as representation for the country)
and a few more. If I consider each as a separate entity the # occurrences becomes very low, and NER training becomes difficult / converges slowly. It is possible to do something like this in spaCy, but is there something equivalent in Prodigy? I can parse the whole text with regex on these phrases and map them to the concept, but that's not very elegant.
Prodigy uses spaCy under the covers, so it should be possible to do most anything with prodigy that you can do with spaCy.
I'm not 100% I understand what you want to do, but have you considered trying out the Pattern Matcher? You could create a patterns.jsonl file with each of your examples as a line with the same LABEL, e.g.
Then you can pass the patterns to prodigy when you start annotating:
prodigy ner.manual [your_args] --patterns ./patterns.jsonl
If you use the pattern matcher, it ends up being decently elegant. Also, if your matches work well enough to regex out of documents, I would say definitely add those patterns to cut down on your annotation time.
thanks for the tips
I do have a pattern matching file (pretty long one) as some patterns do not occur that often, and they should be part of the gold patterns (esp. names in political cases). I agree, it speeds things up
In addition to that I have a scenario that cannot be covered by labels (I think). Putin, Vladimir Putin, Vladimir V. Putin all have the label "PERSON". but they are treated as different entities. I want to have all versions of "Putin" treated as a single entry in the dictionary,
Putin = Vladimir Putin = Vladimir V. Putin
There is something in spaCy that can do that (https://spacy.io/usage/rule-based-matching), and I wonder if there is a quick fix for this in Prodigy.
PS: In retro, I thin I go for regex for the names. Even with 100k lines it is faster (just requires a bit of patience)
This really isn't a named entity recognition task, as the snippets of text aren't named entities. I'm not really sure what the best way to structure your problem will be, but you might try using a text classifier to identify sentences, or perhaps you could use the dependency parse with rules based on the syntactic structure? You would start with the verb,
decide, and go out from there.
thanks for your comments.
In the end I solved the problem the simple way by writing a pattern file. (White House = Oval Office, label US_ADMIN). Since I have a limited domain (politics) sentences like "I painted the white house blue" are rare (I think).
I'm quite confident that it also works for names, because first names and roles are all compound to the last name. It is problematic if there are two presidents with the same last name(George Bush) but that can be solved by making smart patterns.
Maybe QuaD until the glorious failure?