More annotations worsen the F-score?

luri · January 22, 2021, 11:00am

Hey explosionists,

I'm working on a project to create a dataset and train a model to recognise german industry data (manufacturing-steps, materials, etc.) as named entities. The data is served by one company and the 13 labels are custom for their products.
In short my workflow so far was:

ner.manual: 200 examples
ner.correct: 600 examples
ner.teach: 1200 examples
ner.teach: 800 examples with the labels that occur most rarely

I started training a blank:de model and trained the model after each annotation step to monitor the metrics - always with the model of the previous training-step as input.
The F-Score went from 84 % in step 1 to 94 % to 91 % to 82 % after step 4. Also the metrics for the labels seem a bit erratic to me. The train-curve always increased except after step 2.

My Question: Am I doing something fundamentally wrong or do I just have to keep annotating e. g. by using ner.correct?

I really love working with Prodigy and would be extremly happy for any kind of advice.
Thanks in advance!

ines · January 24, 2021, 3:55am

Love this, haven't heard that one before!

This is definitely interesting! When you trained from the binary annotations in step 3 and 4, did you make sure to set the --binary flag on prodigy train to make sure that they're interpreted as binary yes/no answers and the rejects are taken into account? Also, did you train on the total 2000 binary annotations (1200 + 800) or on the 1200 first and then on the 800? The latter could be problematic because if you're only updating with rare labels, the model might overfit on the new data.

luri · January 25, 2021, 1:29am

Uhh I didn't set the binary flag on

I always trained with all annotations as I keep them in one dataset, so 2800 examples.
Would it be smarter to keep the binary annotations in a seperate dataset and train the model seprately - one with the binary flag on and the other without? Training the model that was previously trained with the 800 non-binary annotations using the dataset with 2800 examples and the binary flag on didn't lead to a high accuracy (64 %).

I noticed that the metrics for the other more common labels now are pretty low and the ones for the rare-labels quite good. That could be a sign for overfitting, couldn't it? The marked labels in the screenshot are the rare ones.

My next step would simply be annotating more examples using ner.correct until the F-score is up again

Thank you so much for your help!
It's super interesting to get behind the logic.

ines · January 25, 2021, 10:59pm

Sorry for the confusion and the parallel workflows – it's certainly not ideal to have the two different types of training and it's something we're looking to unify in the future (and with the upgrade to spaCy v3).

Basically, the idea behind it is this: if you have annotations describing spans of text, there are two ways to update the named entity recognizer from them. One is to assume all unannotated tokens are not entities, and the other is assuming they're missing values. spaCy also allows updating the model with information like "I don't know what this token is but I know that it's not B-PERSON" (if you're interested in how this works in more detail, see my slides here). So this is what setting --binary does.

Yes, in general, I'd always recommend using more fine-grained datasets if possible. It's always easy to merge datasets later on (and Prodigy can take care of this for you when you train).

That's definitely interesting and worth keeping an eye on! How are you evaluating? Do you have a dedicated evaluation set? If not, that could be a good next step to work on: you want to make sure you're always evaluating on the same representative dataset so you can compare the results across runs, and properly compare the different approaches.

luri · January 26, 2021, 2:06am

Thank you for your extremely valuable input! Helped me a lot so far.

So far I always evaluated with the entire dataset and the default train-eval-splits but definately get the point of a dedicated evaluation-set!

I think using ner.teach is not the ideal approach for the type of data I deal with - quite short and consitent technical texts describing a part and the manufacturing steps to build it.

With annotating most texts with ner.teach I created a strong bias to certain entities that led to a dataset not suitable for training a good model. At least that is how I think it might be

My approach is now ner.correct - as the F-score was very good even at 800 examples - till the train-curve is happy. Afterwards I will add some ner.teach training-data in a seperate dataset to check the influence.
Does that sound reasonable?

Is there an elegant way to convert a csv-file back to a jsonl? I got kinda stuck between pandas and srsly.
I exported the dataset using db-out and split of the 2000 ner.teach-examples and want to use the remaining 800 fully-annotated and collect some more to have a decent training dataset I can split the eval-data off.

Thank you so much in advance

luri · January 26, 2021, 8:24pm

Update: I created a separate dataset with 500 annotated examples for evaluation

ines · January 27, 2021, 12:30am

Yes, that sounds good to me

Apparently pandas now has a lines flag on the to_json method, so maybe that helps? I found it in this thread, which looks a lot like someone converting patterns for Prodigy

Topic		Replies	Views
ner.teach Updates and expected changes in the scores usage , ner	2	337	April 13, 2021
ner.teach works well for first few tags, then starts suggesting random / weird annotations ner , nightly	2	515	April 28, 2021
ner.teach - couple of questions ner , done , solved , nightly	9	2649	December 30, 2021
Help with messy data usage , ner	8	666	January 20, 2019
F1-score doesn't improve for larger annotation sets usage , ner , spacy	3	552	April 21, 2023

More annotations worsen the F-score?

Related topics