Relax matching criteria in NER scoring?


Are there any references for ways to relax the matching criteria when scoring NER models during training in prodigy? IE. one of our entity is EQUIPMENT, one of the labels is "Truck Unit 1714", and if the model predicted "Truck" or "Unit 1714" that would be an acceptable match rather than having to predict the exact "Truck Unit 1714" match.



I'm not sure this is something you'd want to do in general. Typically, entities as extracted by an NER model do have strict boundaries. In your example, "Truck Unit 1714" seems like a clear Named Entity to me, but "Truck" by itself just doesn't carry the same kind of semantic information.

During annotation and prediction, ideally you'd apply a consistent set of guidelines to make sure the entities are always annotated the same. The reason why is that this consistency will actually help the model make confident predictions. If instead you'd allow the model to make mistakes (without backpropagating the errors on "fuzzy" boundaries), the model would have a harder time to predict what is correct.

Think about this from a human perspective. Let's say I'm a human annotator and one day you're telling me that "Truck" is just fine as entity, and the next you're very happy with "Truck Unit 1714". Now when I get "Truck Unit 89" in a sentence, I'm still confused as to whether I should give you "Truck" or "Truck Unit 89" as an entity.

In short: I think by doing this, you'd be making the NER more difficult, not easier.