I’m working with English documents and am trying to build a model better suited for this domain specific corpus. After running the terms.train-vectors recipe I have 35484257 effective words which should be enough. I’ve trained it using the en_core_web_sm as a base in order to retain the tagger and parser pipes.
Now, when I try to run the ner.teach command, the model doesn’t seem to find any of my patterns. Even if the words I’ve created patterns for are directly in the document, it doesn’t seem to match any of them and is instead focusing on punctuation and whitespace as the proposed entities.
It seems like I’m doing something wrong here, or should I keep going while it’s tagging whitespace? Not all documents in my corpus have this entity, but based on my understanding the patterns should help find documents that do? Note that while I did use the en_core_web_sm model I removed the included ner pipe with the goal of training my own.
Thanks for a push in the right direction!
Edit: I tried this again with en_core_web_lg without my own vectors and it does find the patterns and works as expected, but I think that not having other entities and using my own vectors should improve accuracy? Maybe there’s something wrong with my model? I’m creating the model by:
nlp = spacy.load('en_core_web_lg', disable=["ner"])
nlp.add_pipe(nlp.create_pipe('ner'))
nlp.begin_training()
nlp.to_disk('./tmp/model')
and then running
prodigy ner.teach material_ner ./tmp/model/ ./pages_all.jsonl --patterns ./material_patterns.jsonl
With a patterns file that looks like:
{"label":"MATERIAL","pattern":[{"lower":"cement"}]}
{"label":"MATERIAL","pattern":[{"lower":"concrete"}]}
{"label":"MATERIAL","pattern":[{"lower":"plywood"}]}
{"label":"MATERIAL","pattern":[{"lower":"tiles"}]}
{"label":"MATERIAL","pattern":[{"lower":"steel"}]}
{"label":"MATERIAL","pattern":[{"lower":"aluminum"}]}
Edit 2: I’ve been able to get this to work when I don’t replace the ner pipe with my own. I’m using the en_core_web_sm with my own vectors and can successfully train. How would I properly get this to work with a blank ner pipe?