Similar entities with different context

How can I deal with entities that have a similar pattern but mean something different in context. I want to train a NER that recognizes different properties. The properties often consist of a number followed by the unit. Some entities are very specific so a regex works well to annotate.

For others, such as percent, it is not so easy.

An example:

The Dow Jones is up 3% today.

The elongation at yield is 2%.

Both are in %, but only the second one is to be read. Is it possible that after a little manual training, the model recognizes the context?

Is it possible that after a little manual training, the model recognizes the context?

It depends a bit on how you've defined your NER model but, theoretically at least, the spaCy CNN model also takes the surrounding tokens into account. A model could therefore pick up the pattern that if the tokens "up" or "down" are nearby, the percentage is not relevant to the use case.

If this theoretical argument will work in practice is hard to say upfront. You may need a lot of data for the model to pick up the right pattern and there are also other ways of dealing with this. You could just try to use a regex combined with some custom logic. Something like;

"if the percentage sign is detected and the up/down token isn't in the same sentence, tag it"

In my experience, a combination of rule-based and model-based can work very well too.

1 Like

Hi, first of all thank you for your helpful answer!

"if the percentage sign is detected and the up/down token isn't in the same sentence, tag it"

does this refer to specific words that can occur in the context of percentages? For example (shares, indices, etc.) for percentages around the topic of the stock market? Can this be mapped via patterns? Because the patterns that can be passed along in Prodigy only refer to tokens.

In my experience, a combination of rule-based and model-based can work very well too.

Can you maybe explain this a little more? I am new in the field and very interested in good suggestions.

There's a series on YouTube that I made a while ago that explains the rules+model pattern for spaCy quite nicely (link), but if you prefer the shorter version you can also watch the shorter PyData talk I did on the same topic (link). The use-case for these videos is detecting programming languages in text, so it's likely a few details are different for your use-case.

That said, I usually prefer to have a script or a Jupyter notebook that generates the relevant .jsonl file on disk. I can still use a spaCy pattern matcher from Jupyter, but I can also use all sorts of other tools. I can sort the .jsonl file manually and create any subset that might be of interest. The patterns file in Prodigy is a great feature, but if you want to experiment a bit with combining models you may enjoy working from Jupyter too.

1 Like