I am working on training a new entity type and have been following the demonstration videos. My current effort is to use word embeddings created using FastText in Gensim. I initialized my model as such:
!python3 -m spacy init-model en /tmp/vectors --vectors-loc dispo_vectors.txt
Then I used Prodigy to generate a terminology list, which was then converted to a set of patterns stored as a jsonl
. My interest is to train the model on my text_summaries, which are saved in a text (txt
) format. Here is the code that I am running:
!prodigy ner.teach opioids_ner /tmp/vectors text_summaries.txt --loader txt --label OPIOIDS --patterns opioid_patterns.jsonl
This produces the following error:
KeyError: "[E001] No component 'ner' found in pipeline. Available names: ['sentencizer']"
I understand that this is due to the pipeline not having the ner
. I found that the fix is to add the ner
, which I did using the following:
nlp = spacy.load('/tmp/vectors')
nlp.add_pipe(nlp.create_pipe('ner'))
nlp.to_disk('unigram-empty_ner')
This creates a new directory called unigram-empty_ner
, with a meta.json
file, and two subdirectories, ner
and vocab
. I assumed that I could now just load the model using something like:
vectors_ner_added = spacy.load('unigram-empty_ner')
And, then replace the original model (/tmp/vectors
) with vectors_ner_added
:
!prodigy ner.teach opioids_ner vectors_ner_added text_summaries.txt --loader txt --label OPIOIDS --patterns opioid_patterns.jsonl
But, obviously, that doesn't work because vectors_ner_add
is a directory. Any guidance would be greatly appreciated.
Thanks