concepts representation

aph61 · October 5, 2020, 9:49pm

Hi,

I have a "problem" with NER labeling. I have a number of phrases for the same concept, something like

"US has decided"
"US administration has decided"
"White House has decided"
"Washington has decided" (often capitals names are used as representation for the country)

and a few more. If I consider each as a separate entity the # occurrences becomes very low, and NER training becomes difficult / converges slowly. It is possible to do something like this in spaCy, but is there something equivalent in Prodigy? I can parse the whole text with regex on these phrases and map them to the concept, but that's not very elegant.

Andreas

justindujardin · October 6, 2020, 3:41pm

Hi Andreas,

Prodigy uses spaCy under the covers, so it should be possible to do most anything with prodigy that you can do with spaCy.

I'm not 100% I understand what you want to do, but have you considered trying out the Pattern Matcher? You could create a patterns.jsonl file with each of your examples as a line with the same LABEL, e.g.

{"label":"MY_LABEL","pattern":"Cool stuff"}
{"label":"MY_LABEL","pattern":"Neat things"}
{"label":"MY_LABEL","pattern":"Amazing items"}

Then you can pass the patterns to prodigy when you start annotating:

prodigy ner.manual [your_args] --patterns ./patterns.jsonl

If you use the pattern matcher, it ends up being decently elegant. Also, if your matches work well enough to regex out of documents, I would say definitely add those patterns to cut down on your annotation time.

aph61 · October 7, 2020, 2:45pm

thanks for the tips

I do have a pattern matching file (pretty long one) as some patterns do not occur that often, and they should be part of the gold patterns (esp. names in political cases). I agree, it speeds things up

In addition to that I have a scenario that cannot be covered by labels (I think). Putin, Vladimir Putin, Vladimir V. Putin all have the label "PERSON". but they are treated as different entities. I want to have all versions of "Putin" treated as a single entry in the dictionary,

Putin = Vladimir Putin = Vladimir V. Putin

There is something in spaCy that can do that (https://spacy.io/usage/rule-based-matching), and I wonder if there is a quick fix for this in Prodigy.

PS: In retro, I thin I go for regex for the names. Even with 100k lines it is faster (just requires a bit of patience)

Andreas

honnibal · October 7, 2020, 7:50pm

Hi Andreas,

This really isn't a named entity recognition task, as the snippets of text aren't named entities. I'm not really sure what the best way to structure your problem will be, but you might try using a text classifier to identify sentences, or perhaps you could use the dependency parse with rules based on the syntactic structure? You would start with the verb, decide, and go out from there.

aph61 · October 11, 2020, 2:46pm

Hi Matthew,

thanks for your comments.

In the end I solved the problem the simple way by writing a pattern file. (White House = Oval Office, label US_ADMIN). Since I have a limited domain (politics) sentences like "I painted the white house blue" are rare (I think).

I'm quite confident that it also works for names, because first names and roles are all compound to the last name. It is problematic if there are two presidents with the same last name(George Bush) but that can be solved by making smart patterns.

Maybe QuaD until the glorious failure?

Andreas

Topic		Replies	Views
(Re)using labels in patterns usage , spacy	1	315	July 21, 2021
Patterns and custom NER usage , ner	1	2768	December 27, 2017
Add a whole bunch of entities via a vocabulary usage , ner , spacy	2	379	July 13, 2021
Store the annotation obtained by ner.manual and --patterns at once usage , ner , spacy , solved	4	662	June 28, 2021
sequence labelling with prodigy ? usage	2	625	February 27, 2018

concepts representation

Related topics