I have a doubt about multi term entities. I want to create a NER model to identify products in texts, both specific product names but also more general, based on customer complaints. eg (specific: waze, spotify, iphone 6S 32Gb, Ford Taurus, etc., Barbie Doll, Smart TV LG 43 43uiS40), but also general products as smart tv, doll, etc. when it appears alone, without further specification. The problem is when I try to use ner.teach if SMART TV LG 43 appear in the text, prodigy tends to highlight only smart tv. If I accept it reinforces the behaviour and will allways show only smart tv and never highlight the full product SMART TV LG 43. Am I doing something wrong?
Hi! If you want to use binary annotation with a model in the loop, you’re always giving feedback on the suggestion in this exact context. So you should definitely reject incomplete spans. This way, you’re telling the model “no, this particular analysis of the text is incorrect”, the weights will be updated to reflect that particular decision and the model will “try again” with a different analysis, hopefully moving towards more correct entity boundaries.
That said, if your data contains a lot of fairly abstract multi-token entities like that and the model struggles, it might take pretty long until it converges (or it might not converge at all). You could try adding some
--patterns, or collect a small set with
ner.make-gold that covers the especially complex entities, pre-train the model with that and then improve that pre-trained model further with
ner.teach. You might also want to check out this thread, which discusses an approach for extending entity boundaries with rules: Expanding NER to include neighbouring tokens