Using terms.train-vectors recipe with NER

honnibal · March 3, 2018, 12:55pm

Thanks for the detailed report. I’m not 100% sure but I think you’re not doing anything wrong. Instead I think the behavior of the ner.teach could be improved here.

When adding new entity types for my own experiments, I’ve mostly been starting from the pre-trained models. However, it’s definitely important to be able to start from a blank model! So this is a gap in our QA, and it looks like we need better heuristics in the ner.teach recipe to support this.

Here’s what I think is going on.

When you use the --patterns flag, the recipe interleaves questions from two models: the PatternMatcher, and the EntityRecognizer. The --label argument tells the models to filter out questions which don’t have the label you’re interested in. The prefer_uncertain function then sorts the stream of scored questions, to ask you ones where the score is close to 0.5.

When you run the ner.teach with a pre-trained model, it’s pretty confident in its predictions, so its questions about MATERIAL-labelled entities will be very close to 0.0 in score. The first questions you receive will therefore be from the PatternMatcher. As you update the EntityRecognizer, the scores for MATERIAL entities will increase, and you’ll start to be asked some of those questions. Those initial questions will be pretty off-base, but your corrections and the existing weights mean it learns the category surprisingly quickly.

When starting with a blank model, the score distribution in the entity recognizer is pretty uniform. This means you’re asked a lot of nonsense questions. If you click through all of these the model should eventually learn. However, this workflow can obviously be improved.

Two suggestions:

An easy thing that will help a bit, without solving the underlying issue: pre-process to remove the whitespace Tip: Preprocessing text (whitespace, unicode) with textacy
You might try writing rules which exclude a phrase from being an entity. You can then use these to automatically answer those implausible questions, making boot-strapping go faster. There’s code for this in this thread: patterns using regex or shape . This “anti-pattern” workflow has been suggested as a feature request.

Topic		Replies	Views
New language model for NER usage , ner , spacy , solved	2	569	September 17, 2019
Training NER model from scratch using (forward-looking) patterns usage	8	689	December 17, 2019
Following NER annotation flowchart. Questions on new model and patterns file usage , ner	2	532	August 30, 2019
Transfer Learning for NER usage , ner	6	2490	May 24, 2021
[Request] best practice for bootstrapping data for training partially new Named Entites? (and a question about PhraseMatcher ) usage , ner , spacy , best-practices , training	3	289	February 16, 2024

Using terms.train-vectors recipe with NER

Related topics