I can’t use the NER provided with Spacy’s out-of-box models because my underlying data is too different from the out-of-box models’ training data. Entities like LAW, EVENT, and WORK_OF_ART are almost nonexistent for example. But, to use ner.batch_train I still need to supply it with a model. Is the problem that I can only use batch_train to improve / update weights in an existing model, and I need to create the initial model outside of prodigy?
Hi! The model you pass into recipes like ner.batch-train
doesn’t have to be pre-trained – it just needs to be a base model that includes the language data to use and the tokenization rules etc. This also makes it easier to start off with custom tokenization rules (which can often be very useful for specific domains), or with any other language supported by spaCy that we don’t ship pre-trained models for yet.
To create a blank base model, you can do the following:
import spacy
nlp = spacy.blank("en") # or any other language
nlp.to_disk("/path/to/model")
Or, as a handy one-liner:
python -c "import spacy; spacy.blank('en').to_disk('/path/to/model')"
You can then load /path/to/model
in as the base model when you call ner.batch-train
.
The latest version of Prodigy also supports a shortcut for this and lets you pass in blank:en
to start off with a blank model of a given language.
Thanks! blank:en seems to work nicely. With a blank model though, is there a way to still have it do dependency and pos tagging? It looks like all it can do is tokenization.
Try:
import spacy
nlp = spacy.load("en_core_web_lg")
nlp.disable_pipes("ner")
nlp.to_disk("./en_dep_web_lg")
This should save you out a model directory that has the original components, sans the ner
. You can then use that model directory with Prodigy.
You can also manipulate the model directories directly, if that ends up being more convenient. Just remove the directories you don’t want, and update the meta.json file.
Thanks! For completely training a new NER model, then, does this workflow make sense?
- create and train a separate dataset for each entity.
- merge these datasets into a ‘silver’ dataset that combines all the annotations.
- run make-gold on the dataset (required for some reason)
- batch train on the gold dataset
I’m just concerned that this would be slow going, because in the silver and gold recipes the model wasn’t really making any recommendations and I had to add most of them by hand. Is there something I’m missing?
@oneextrafact That does sound roughly right. You may not need to run ner.make-gold
after merging, although I do think it’s a good idea to do a review of the dataset, especially since you might get conflicts (the same text annotated with different types).
You might find the answer here helpful: Work Flow for extending an NER model with new entity types
Thanks for the advice! One last question - when I evaluate the model, I just get an absolute number for correct and incorrect entities. Is there some easy way to generate a confusion matrix, so that I can figure out where I need to direct my efforts for training?
We don’t have that implemented yet unfortunately. It’s something we plan to add, but for now you’d have to code it up yourself.