Hi! (And thanks to @andy for the great answers!)
The easiest way to get started with modifying the recipe code is to look at the built-in recipes and tweak them. This will show you how everything fits together. The source of the recipes is shipped with Prodigy, and you can find the location of your installation like this:
python -c "import prodigy; print(prodigy.__file__)"
In recipes/ner.py
, you’ll then find the recipe source and examples of how the sorter
is implemented.
Alternatively, here are some other ideas you could experiment with: It looks like the \n
character is really mostly the problem here, and it’s something that we’ve observed before. When you run ner.teach
, Prodigy will look at all possible entity analyses for the text and then suggest the ones that the model is most uncertain about. And for some reason, this seems to be the \n
, at least in the beginning.
If it’s not that important for your final model to be able to deal with random newlines at runtime (for example, if you can pre-process the text before analysing it), you could just add a pre-processing step that removes newlines. Since they really throw off the model, it might be more efficient for now to just strip them out, rather than to try and teach the model about them in detail.
Additionally, you could also try and start off with the German model and a “blank” entity recognizer (instead of the pre-trained one). This especially makes sense if you’re only interested in your custom entities, and not in any of the other ones that the model predicts by default. I’m not sure if it’ll make a big difference here, but the idea is that a blank model will have no “constraints” that your new entities will have to fit to.
For example, the built-in entity types were trained on tens of thousands of examples, likely more than you will collect for your DISEASE
type. The German entity recognizer also tends to struggle more with identifying entities, since it can’t rely so much on capitalisation. In English, a capitalised token is a strong indicator for an entity – in German, it could just be any regular noun. So if the pre-trained entity recognizer is already super confident that “Fieber” is a MISC
or an ORG
or whatever, it’ll be much more difficult to teach it a new definition.
Here’s how you can export the German model with a blank entity recognizer:
import spacy
nlp = spacy.load('de_core_news_sm') # load base model
new_ner = nlp.create_pipe('ner') # create blank entity recognizer
nlp.replace_pipe('ner', new_ner) # replace old one with new component
# make sure weights of new blank component are initialized
# (this step will likely not be necessary in the future)
nlp.begin_training()
nlp.to_disk('/path/to/model')
Prodigy can also load models from a path (just like spacy.load
), so when you run ner.teach
, you can now replace the model name with the path to the saved out new model:
prodigy ner.teach disease_ner /path/to/model ...
(Btw, good luck with your thesis! If I remember correctly, there were several other threads on the forum that discussed training biomedical entities with Prodigy, so maybe you’ll also find some inspiration there.)