Using terms.train-vectors recipe with NER

Thanks for the detailed report. I’m not 100% sure but I think you’re not doing anything wrong. Instead I think the behavior of the ner.teach could be improved here.

When adding new entity types for my own experiments, I’ve mostly been starting from the pre-trained models. However, it’s definitely important to be able to start from a blank model! So this is a gap in our QA, and it looks like we need better heuristics in the ner.teach recipe to support this.

Here’s what I think is going on.

When you use the --patterns flag, the recipe interleaves questions from two models: the PatternMatcher, and the EntityRecognizer. The --label argument tells the models to filter out questions which don’t have the label you’re interested in. The prefer_uncertain function then sorts the stream of scored questions, to ask you ones where the score is close to 0.5.

When you run the ner.teach with a pre-trained model, it’s pretty confident in its predictions, so its questions about MATERIAL-labelled entities will be very close to 0.0 in score. The first questions you receive will therefore be from the PatternMatcher. As you update the EntityRecognizer, the scores for MATERIAL entities will increase, and you’ll start to be asked some of those questions. Those initial questions will be pretty off-base, but your corrections and the existing weights mean it learns the category surprisingly quickly.

When starting with a blank model, the score distribution in the entity recognizer is pretty uniform. This means you’re asked a lot of nonsense questions. If you click through all of these the model should eventually learn. However, this workflow can obviously be improved.

Two suggestions:

  1. An easy thing that will help a bit, without solving the underlying issue: pre-process to remove the whitespace Tip: Preprocessing text (whitespace, unicode) with textacy

  2. You might try writing rules which exclude a phrase from being an entity. You can then use these to automatically answer those implausible questions, making boot-strapping go faster. There’s code for this in this thread: patterns using regex or shape . This “anti-pattern” workflow has been suggested as a feature request.