Thanks for the detailed report. I’m not 100% sure but I think you’re not doing anything wrong. Instead I think the behavior of the ner.teach
could be improved here.
When adding new entity types for my own experiments, I’ve mostly been starting from the pre-trained models. However, it’s definitely important to be able to start from a blank model! So this is a gap in our QA, and it looks like we need better heuristics in the ner.teach
recipe to support this.
Here’s what I think is going on.
When you use the --patterns
flag, the recipe interleaves questions from two models: the PatternMatcher
, and the EntityRecognizer
. The --label
argument tells the models to filter out questions which don’t have the label you’re interested in. The prefer_uncertain
function then sorts the stream of scored questions, to ask you ones where the score is close to 0.5.
When you run the ner.teach
with a pre-trained model, it’s pretty confident in its predictions, so its questions about MATERIAL
-labelled entities will be very close to 0.0 in score. The first questions you receive will therefore be from the PatternMatcher
. As you update the EntityRecognizer
, the scores for MATERIAL
entities will increase, and you’ll start to be asked some of those questions. Those initial questions will be pretty off-base, but your corrections and the existing weights mean it learns the category surprisingly quickly.
When starting with a blank model, the score distribution in the entity recognizer is pretty uniform. This means you’re asked a lot of nonsense questions. If you click through all of these the model should eventually learn. However, this workflow can obviously be improved.
Two suggestions:
-
An easy thing that will help a bit, without solving the underlying issue: pre-process to remove the whitespace Tip: Preprocessing text (whitespace, unicode) with textacy
-
You might try writing rules which exclude a phrase from being an entity. You can then use these to automatically answer those implausible questions, making boot-strapping go faster. There’s code for this in this thread: patterns using regex or shape . This “anti-pattern” workflow has been suggested as a feature request.