Initializing custom model for ner

I am working on training a new entity type and have been following the demonstration videos. My current effort is to use word embeddings created using FastText in Gensim. I initialized my model as such:

!python3 -m spacy init-model en /tmp/vectors --vectors-loc dispo_vectors.txt

Then I used Prodigy to generate a terminology list, which was then converted to a set of patterns stored as a jsonl. My interest is to train the model on my text_summaries, which are saved in a text (txt) format. Here is the code that I am running:

!prodigy ner.teach opioids_ner /tmp/vectors text_summaries.txt --loader txt --label OPIOIDS --patterns opioid_patterns.jsonl

This produces the following error:

KeyError: "[E001] No component 'ner' found in pipeline. Available names: ['sentencizer']"

I understand that this is due to the pipeline not having the ner. I found that the fix is to add the ner, which I did using the following:

nlp = spacy.load('/tmp/vectors')
nlp.add_pipe(nlp.create_pipe('ner'))
nlp.to_disk('unigram-empty_ner')

This creates a new directory called unigram-empty_ner, with a meta.json file, and two subdirectories, ner and vocab. I assumed that I could now just load the model using something like:

vectors_ner_added = spacy.load('unigram-empty_ner')

And, then replace the original model (/tmp/vectors) with vectors_ner_added:

!prodigy ner.teach opioids_ner vectors_ner_added text_summaries.txt --loader txt --label OPIOIDS --patterns opioid_patterns.jsonl

But, obviously, that doesn't work because vectors_ner_add is a directory. Any guidance would be greatly appreciated.

Thanks

Hi! In theory, you can just add the component exactly like you did – but you need to call nlp.initialize() to initialize the weights. Then you can save the output to a directory. Prodigy supports loading a model from a package name, or from a path – so you don't need to call spacy.load in Python, you can just pass in your directory path unigram-empty_ner as the spaCy model when you start Prodigy. (If you're in a notebook, it's maybe a bit less intuitive, but keep in mind that the prodigy commands are CLI commands, not code, so they're executed in a different context than the Python code.)

Btw, since the drug prediction video is a bit older and Prodigy has a bunch of other additional workflows now: instead of doing a "cold start" with ner.teach (using a model that doesn't know anything and trying to teach it enough with patterns so it can make suggestions), you could also try and collect a small dataset of semi-manual annotations using ner.manual + your patterns. This can give you a more reliable start, because you can make sure that the model sees enough examples of your entity type, and enough texts where it sees the correct answer for every single token.