Using terms.train-vectors recipe with NER

I’m working with English documents and am trying to build a model better suited for this domain specific corpus. After running the terms.train-vectors recipe I have 35484257 effective words which should be enough. I’ve trained it using the en_core_web_sm as a base in order to retain the tagger and parser pipes.

Now, when I try to run the ner.teach command, the model doesn’t seem to find any of my patterns. Even if the words I’ve created patterns for are directly in the document, it doesn’t seem to match any of them and is instead focusing on punctuation and whitespace as the proposed entities.

It seems like I’m doing something wrong here, or should I keep going while it’s tagging whitespace? Not all documents in my corpus have this entity, but based on my understanding the patterns should help find documents that do? Note that while I did use the en_core_web_sm model I removed the included ner pipe with the goal of training my own.

Thanks for a push in the right direction!

Edit: I tried this again with en_core_web_lg without my own vectors and it does find the patterns and works as expected, but I think that not having other entities and using my own vectors should improve accuracy? Maybe there’s something wrong with my model? I’m creating the model by:

nlp = spacy.load('en_core_web_lg', disable=["ner"])
nlp.add_pipe(nlp.create_pipe('ner'))
nlp.begin_training()
nlp.to_disk('./tmp/model')

and then running

prodigy ner.teach material_ner ./tmp/model/ ./pages_all.jsonl --patterns ./material_patterns.jsonl

With a patterns file that looks like:

{"label":"MATERIAL","pattern":[{"lower":"cement"}]}
{"label":"MATERIAL","pattern":[{"lower":"concrete"}]}
{"label":"MATERIAL","pattern":[{"lower":"plywood"}]}
{"label":"MATERIAL","pattern":[{"lower":"tiles"}]}
{"label":"MATERIAL","pattern":[{"lower":"steel"}]}
{"label":"MATERIAL","pattern":[{"lower":"aluminum"}]}

Edit 2: I’ve been able to get this to work when I don’t replace the ner pipe with my own. I’m using the en_core_web_sm with my own vectors and can successfully train. How would I properly get this to work with a blank ner pipe?

Thanks for the detailed report. I’m not 100% sure but I think you’re not doing anything wrong. Instead I think the behavior of the ner.teach could be improved here.

When adding new entity types for my own experiments, I’ve mostly been starting from the pre-trained models. However, it’s definitely important to be able to start from a blank model! So this is a gap in our QA, and it looks like we need better heuristics in the ner.teach recipe to support this.

Here’s what I think is going on.

When you use the --patterns flag, the recipe interleaves questions from two models: the PatternMatcher, and the EntityRecognizer. The --label argument tells the models to filter out questions which don’t have the label you’re interested in. The prefer_uncertain function then sorts the stream of scored questions, to ask you ones where the score is close to 0.5.

When you run the ner.teach with a pre-trained model, it’s pretty confident in its predictions, so its questions about MATERIAL-labelled entities will be very close to 0.0 in score. The first questions you receive will therefore be from the PatternMatcher. As you update the EntityRecognizer, the scores for MATERIAL entities will increase, and you’ll start to be asked some of those questions. Those initial questions will be pretty off-base, but your corrections and the existing weights mean it learns the category surprisingly quickly.

When starting with a blank model, the score distribution in the entity recognizer is pretty uniform. This means you’re asked a lot of nonsense questions. If you click through all of these the model should eventually learn. However, this workflow can obviously be improved.

Two suggestions:

  1. An easy thing that will help a bit, without solving the underlying issue: pre-process to remove the whitespace Tip: Preprocessing text (whitespace, unicode) with textacy

  2. You might try writing rules which exclude a phrase from being an entity. You can then use these to automatically answer those implausible questions, making boot-strapping go faster. There’s code for this in this thread: patterns using regex or shape . This “anti-pattern” workflow has been suggested as a feature request.