I’m working with English documents and am trying to build a model better suited for this domain specific corpus. After running the terms.train-vectors
recipe I have 35484257 effective words
which should be enough. I’ve trained it using the en_core_web_sm
as a base in order to retain the tagger
and parser
pipes.
Now, when I try to run the ner.teach
command, the model doesn’t seem to find any of my patterns. Even if the words I’ve created patterns for are directly in the document, it doesn’t seem to match any of them and is instead focusing on punctuation and whitespace as the proposed entities.
It seems like I’m doing something wrong here, or should I keep going while it’s tagging whitespace? Not all documents in my corpus have this entity, but based on my understanding the patterns should help find documents that do? Note that while I did use the en_core_web_sm
model I removed the included ner
pipe with the goal of training my own.
Thanks for a push in the right direction!
Edit: I tried this again with en_core_web_lg
without my own vectors and it does find the patterns and works as expected, but I think that not having other entities and using my own vectors should improve accuracy? Maybe there’s something wrong with my model? I’m creating the model by:
nlp = spacy.load('en_core_web_lg', disable=["ner"])
nlp.add_pipe(nlp.create_pipe('ner'))
nlp.begin_training()
nlp.to_disk('./tmp/model')
and then running
prodigy ner.teach material_ner ./tmp/model/ ./pages_all.jsonl --patterns ./material_patterns.jsonl
With a patterns file that looks like:
{"label":"MATERIAL","pattern":[{"lower":"cement"}]}
{"label":"MATERIAL","pattern":[{"lower":"concrete"}]}
{"label":"MATERIAL","pattern":[{"lower":"plywood"}]}
{"label":"MATERIAL","pattern":[{"lower":"tiles"}]}
{"label":"MATERIAL","pattern":[{"lower":"steel"}]}
{"label":"MATERIAL","pattern":[{"lower":"aluminum"}]}
Edit 2: I’ve been able to get this to work when I don’t replace the ner pipe with my own. I’m using the en_core_web_sm
with my own vectors and can successfully train. How would I properly get this to work with a blank ner pipe?