How do I train a custom ner model?

oneextrafact · June 20, 2019, 2:57pm

I can’t use the NER provided with Spacy’s out-of-box models because my underlying data is too different from the out-of-box models’ training data. Entities like LAW, EVENT, and WORK_OF_ART are almost nonexistent for example. But, to use ner.batch_train I still need to supply it with a model. Is the problem that I can only use batch_train to improve / update weights in an existing model, and I need to create the initial model outside of prodigy?

ines · June 20, 2019, 3:20pm

Hi! The model you pass into recipes like ner.batch-train doesn’t have to be pre-trained – it just needs to be a base model that includes the language data to use and the tokenization rules etc. This also makes it easier to start off with custom tokenization rules (which can often be very useful for specific domains), or with any other language supported by spaCy that we don’t ship pre-trained models for yet.

To create a blank base model, you can do the following:

import spacy
nlp = spacy.blank("en")  # or any other language
nlp.to_disk("/path/to/model")

Or, as a handy one-liner:

python -c "import spacy; spacy.blank('en').to_disk('/path/to/model')"

You can then load /path/to/model in as the base model when you call ner.batch-train.

The latest version of Prodigy also supports a shortcut for this and lets you pass in blank:en to start off with a blank model of a given language.

oneextrafact · June 20, 2019, 8:40pm

Thanks! blank:en seems to work nicely. With a blank model though, is there a way to still have it do dependency and pos tagging? It looks like all it can do is tokenization.

honnibal · June 20, 2019, 9:47pm

Try:


import spacy
nlp = spacy.load("en_core_web_lg")
nlp.disable_pipes("ner")
nlp.to_disk("./en_dep_web_lg")

This should save you out a model directory that has the original components, sans the ner. You can then use that model directory with Prodigy.

You can also manipulate the model directories directly, if that ends up being more convenient. Just remove the directories you don’t want, and update the meta.json file.

oneextrafact · June 21, 2019, 8:54pm

Thanks! For completely training a new NER model, then, does this workflow make sense?

create and train a separate dataset for each entity.
merge these datasets into a ‘silver’ dataset that combines all the annotations.
run make-gold on the dataset (required for some reason)
batch train on the gold dataset

I’m just concerned that this would be slow going, because in the silver and gold recipes the model wasn’t really making any recommendations and I had to add most of them by hand. Is there something I’m missing?

honnibal · June 24, 2019, 10:17pm

@oneextrafact That does sound roughly right. You may not need to run ner.make-gold after merging, although I do think it’s a good idea to do a review of the dataset, especially since you might get conflicts (the same text annotated with different types).

You might find the answer here helpful: Work Flow for extending an NER model with new entity types

oneextrafact · June 25, 2019, 4:31pm

Thanks for the advice! One last question - when I evaluate the model, I just get an absolute number for correct and incorrect entities. Is there some easy way to generate a confusion matrix, so that I can figure out where I need to direct my efforts for training?

honnibal · June 25, 2019, 7:02pm

We don’t have that implemented yet unfortunately. It’s something we plan to add, but for now you’d have to code it up yourself.

Topic		Replies	Views
Blank spacy model without being trained usage , ner , spacy , solved	6	3331	July 29, 2021
NER and blank models usage , ner , spacy , solved	9	3744	December 11, 2019
ner.batch-train not to use default labels but just the ones from a training sample ner , spacy , solved	8	738	July 30, 2018
How to use customized spaCy model in Prodigy? ner , spacy	6	485	July 3, 2023
Blank spacy model vs en_core_web_xx usage , ner , spacy , custom	2	876	October 25, 2021

How do I train a custom ner model?

Related topics