You're right that using the active learning from a "cold start" isn't very efficient. I think if you're starting a new model the best way is actually to use the terms.teach
recipe and create a word list for the entities you're interested in. Then you can create a rule-based system that starts suggesting you candidate entities, which gives you something to say yes/no to. This then gives the model something to learn from, so the active learning can get started.
You can also start the training by doing manual annotation, with something like BRAT. However I think there's a similar issue, because you want to make sure you select texts which have a decent number of the entities you're interested in. This means you end up using a rule-based approach to "bootstrap" as well.
We're working on a tutorial and an extra NER recipe, ner.bootstrap
, that makes this workflow more explicit. It's working quite well in our testing so far, especially when using more detailed patterns with spaCy's Matcher
class.
You can control this by setting the sorter in the recipe, but by default we use what the literature calls "uncertainty sampling": we pick examples where the confidence is closest to 0.5. This policy produces the largest expected gradient. There are some tricks to doing this nicely in a streaming setting, while keeping the application responsive. Sometimes it's good to bias the sampling towards predictions of "True", because we can use annotations that are marked "accept" more directly. If we answer "reject" we don't come away knowing the annotation, just that the model was wrong.
The uncertainty sampling is done by the function prefer_uncertain
. The bias
argument lets you shift towards predictions closer to 1.0 or 0.0. By default, the ner.teach
recipe sets a bias of 0.8.