Active learning and its reflection on accuracy

I am training a NER model to extract the named of injured body parts. For example if I have the text

The employee dropped the hammer from their left hand and it fell on their foot.

I want to annotate “foot” but not “hand” and INJURED_BODY_PART. So the model is learning both lexical and contextual information. I am only learning this new tag and not trying to retain model accuracy for other tags, so I don’t think catastrophic forgetting should be an issue here.

At the end of my annotation session the progress-to-zero-error bar was reading about 90%, and I was accepting most of the suggestions that Prodigy was making. Then I ran an ner-batch and saw the following.

prodigy ner.batch-train safety_new safety2.model --output safety3.model --label INJURED_BODY_PART

Loaded model safety2.model
Using 50% of accept/reject examples (291) for evaluation
Using 100% of remaining examples (643) for training
Dropout: 0.2 Batch size: 32 Iterations: 10

BEFORE 0.010
Correct 7
Incorrect 722
Entities 2273
Unknown 8


01 7.707 87 642 1344 0 0.119
02 5.680 91 638 245 0 0.125
03 4.709 127 602 360 0 0.174
04 3.040 160 569 418 0 0.219
05 3.283 165 564 375 0 0.226
06 2.980 159 570 402 0 0.218
07 2.381 170 559 448 0 0.233
08 2.665 170 559 706 0 0.233
09 2.734 169 560 777 0 0.232
10 1.581 164 565 798 0 0.225

Correct 170
Incorrect 559
Baseline 0.010
Accuracy 0.233

I would expect the model to perform better on this task. The fact that the accuracy is flat makes me think that I don’t have enough training data. However, this seems strange because the active learning model in the Prodigy UI seemed to be doing extremely well, and I would expect that model to underestimate the true accuracy. However the accruacy in ner.batch is about 20%, but towards the end of Prodigy training
I was hitting Accept way more than 20% of the time.

My intuition is that this is strange. If I am hitting Accept way more than 20% of the time during active learning, I should see much better than 20% cross validation accuracy. Is this intuition correct, or is there something I’m overlooking?

Try fiddling with the hyper-parameters? Experiment with a lower batch size, and try increasing the dropout rate. The loss is decreasing so the model does think it’s learning something.

Also, is safety2.model a model you’d trained previously? Where did you get it from – is it the output of ner.batch-train? You probably want to start from a blank model. en_vectors_web_lg would probably be good: it has word vectors, but no initial NER model. The NER models in the en_core data packs have trained on a task so different from yours that they’re probably not that useful.

You can run ner.train-curve to figure out whether more data is a likely solution. My guess is you’ll see similar (lack of) accuracy with 80% or 60% of your annotations, indicating that 120% or 140% of your annotations won’t make an enormous difference.

I think hyper-parameters and initialization are a likelier solution. This stuff is still quite poorly understood in general, but I think if the batch size is large at the start of training, the model becomes quite sensitive to the initialization.

Hi Matthew,

Thanks for your prompt reply! Really appreciate it!

I tried tweaking the hyper parameters and my observation was that there wasn’t a drastic difference in the accuracy numbers.
I did find ner.train-curve method useful. Looking at the results, I noticed that my model needed more sample data. Its surprising because I did most of the annotation part till I encountered “No Tasks Available” pop up.

If I try re-annotating, the prodigy interface starts from sample 1 again. I was wondering if there is a way that I can pick the range of spans to annotate or annotate the samples from the end?

Thank you so much. Appreciate all the help.

To avoid repeating your previous examples, add --exclude safety_new as an argument to ner.teach. This looks at the task hashes and removes questions that match your previous data.

If you want to exclude the whole input, rather than just the exact tasks, you can use the prodigy.components.filters.filter_inputs function:

# Get input hashes for a dataset, and remove examples that match them.
DB = connect()
input_hashes = DB.get_input_hashes(dataset)
stream = filter_inputs(stream, input_hashes)

Do double check your initialisation, too — if you’re training on top of a model you previously trained, you might be starting from a bad spot.