First of all just wanted to say, amazing job on Prodigy! I purchased a license a few months ago and it has saved me so much time.
I have a problem I am having a hard time coming up with an approach for. I am attempting to identify quantities, measurements, and base ingredients out of listed ingredient strings in a recipe. This is a pretty simple NER task and Spacy does quite well with it. The problem is examples like this, where there are multiple quantities and ingredients with unrelated text in-between(“and” in this case):
"1 red and 2 green bell pepper cut into 1/2" pieces"
How do I connect “red” and “green” to bell pepper? Likewise, ensuring the quantities are with the proper entity. So end result, would be:
Ingredient: red bell pepper
Ingredient: green bell pepper
You can see they both share “bell pepper”, but it is referring to two different things. I am still quite new to Spacy and machine learning in general, so I apologize if this is a pretty basic thing. Thanks in advance!
Glad to hear Prodigy is going well for you!
I think your question actually touches on a pretty common issue around using NER for different use-cases.
A particular feature of names is that they're mostly atomic. They do still have some internal structure (e.g.
"University of Kansas"), but mostly you can treat them as a flat span. Importantly, they don't combine freely with the rest of the grammar. Even if you have two entities "University of Kansas" and "Constitution of Kansas", you can't generally say something like "University and Constitution of Kansas". Or at least, that wouldn't be the name of anything.
Sometimes names do behave a bit less atomically. For instance, sometimes you'll find mentions like Microsoft Xbox and Xbox 360, and it'll be unclear how to annotate them. But this type of mention is pretty rare.
The problem you're having is that "red bell pepper" is an ordinary noun phrase, not a name. And that means it doesn't behave atomically. The syntax allows the structure to be split apart and combined with other elements in many ways, such as "red and green bell peppers", or "bell peppers that are red", or "reddish green bell peppers", or "red peppers (ideally poblano, but bell will do too)".
It's up to you to decide how important these edge-cases are, and how you want to deal with them. The dependency parser provides a tree representation of the sentence, which gives you a more accurate model of how the words are related. You could work with the tree shape instead of the word sequence to deal with these coordination constructions.
Alternatively, you might decide that it's worth giving up on some of these cases. Sure, NER might not fit the phenomena you're trying to capture properly, but the difference might not be important. You might just have to accept that there's no way to annotate "red and green bell peppers" under the approximation you're making.