Domain Specific Dictionary Files

Hi, i am new to prodigy and want to build NER model for italian languge medical data. I am using flair embeddings and I have several domain spesific dictionary files (person names, city names, etc) . I am confuse how to add these dictionary data into NER models. Any suggestions would be very helpful

Thanks!

Hi! I think the most straightforward option would be to stream in your raw text, match all the entries in your dictionary in the text if they occur, and then correct/update them manually to create your final training data. So every time a person name from your dictionary is found, you add a span for PERSON to the example. Here's an example that shows how you can stream in predictions from a custom model – but instead of a custom model, this could also just be your dictionary lookup: https://prodi.gy/docs/named-entity-recognition#custom-model

Prodigy's built-in recipes let you use spaCy's Matcher to pre-label examples based on token-based rules and dictionaries: https://prodi.gy/docs/named-entity-recognition#manual-patterns This can be a bit more flexible than just dictionary lookups, because you'll be able to describe tokens and their attributes and do stuff like "any number plus case-insensitive 'january', 'february', ...".

It's typically a good idea to view your dictionary matches during annotation and correct them, so you can get a feeling for what's missing, and correct and mistakes. Those are often the ones that are especially interesting: misspellings, new entities that aren't in your dictionary yet and of course ambiguous entities where they context matters ("apple" vs. "apple") and which is where NER makes the most difference. If there are spans that you know will always be a given entity, you can always add rules on top to boost your accuracy (for example, "Apple, Inc." will always be an ORG): https://spacy.io/usage/rule-based-matching#entityruler