Add a whole bunch of entities via a vocabulary

Hi, I've been using Prodigy for a while now and I've had pretty good results already. There is however a problem. Say for example I have a list of entities (something like ABC, DEF, GHI, JKL, etc. all tagged LETTERS). Now, my training dataset, while being quite big, doesn't mention ALL of these tags as they are too many. Therefore what happens is that the trained model recognizes ABC and DEF most of the times, but fails to recognize GHI and JKL since they never popped out in the dataset I annotated. My question is: would it be possible to add an underlying vocabulary containing all of my terms and their respective label? I already did something like that with a pattern in the very first step with ner.manual , but I'd like my model to recognize the entities in the dataset once trained.

The only other option left is to generate a fake dataset with all the entities we have but I hope there is a smarter way.

Thanks

Hi, if you have a full pattern list for the entities you always want to label, you can add an entity_ruler to your final pipeline to annotate them directly. If you're using prodigy v1.10, here are the corresponding spaCy v2 docs: https://v2.spacy.io/api/entityruler. The pattern format should be the same because prodigy and the entity ruler are both using spaCy's matchers underneath.

Neither entity_ruler or ner overwrite existing entities by default. Typically, people run the entity ruler first to be sure that all the known entities from the patterns are annotated for sure and then run ner to fill in the rest. The order of the components can affect the results a bit for the ner model and it's also possible to have the entity ruler overwrite entities if it's run second, so you'd have to try out the options and see what makes sense for your task.

Thank you, it looks like this is exactly what I was looking for.