annotating entities in text documents

honnibal · November 28, 2017, 2:00am

You're right that using the active learning from a "cold start" isn't very efficient. I think if you're starting a new model the best way is actually to use the terms.teach recipe and create a word list for the entities you're interested in. Then you can create a rule-based system that starts suggesting you candidate entities, which gives you something to say yes/no to. This then gives the model something to learn from, so the active learning can get started.

You can also start the training by doing manual annotation, with something like BRAT. However I think there's a similar issue, because you want to make sure you select texts which have a decent number of the entities you're interested in. This means you end up using a rule-based approach to "bootstrap" as well.

We're working on a tutorial and an extra NER recipe, ner.bootstrap, that makes this workflow more explicit. It's working quite well in our testing so far, especially when using more detailed patterns with spaCy's Matcher class.

You can control this by setting the sorter in the recipe, but by default we use what the literature calls "uncertainty sampling": we pick examples where the confidence is closest to 0.5. This policy produces the largest expected gradient. There are some tricks to doing this nicely in a streaming setting, while keeping the application responsive. Sometimes it's good to bias the sampling towards predictions of "True", because we can use annotations that are marked "accept" more directly. If we answer "reject" we don't come away knowing the annotation, just that the model was wrong.

The uncertainty sampling is done by the function prefer_uncertain. The bias argument lets you shift towards predictions closer to 1.0 or 0.0. By default, the ner.teach recipe sets a bias of 0.8.

Topic		Replies	Views
Annotating custom entities in job descriptions usage , custom , hr	9	1160	June 2, 2019
Ambiguous NER annotation decisions usage , ner , solved , best-practices	12	4675	February 12, 2018
Multi-word entity seeding, entity context usage , ner	19	3960	November 1, 2019
Determining the best annotation pipeline for our scenario usage , ner , best-practices	5	1017	April 29, 2019
Annotation for Argument Mining usage , custom , solved	17	2197	June 29, 2018

annotating entities in text documents

Related topics