More annotations worsen the F-score?

Hey explosionists,

I'm working on a project to create a dataset and train a model to recognise german industry data (manufacturing-steps, materials, etc.) as named entities. The data is served by one company and the 13 labels are custom for their products.
In short my workflow so far was:

  1. ner.manual: 200 examples
  2. ner.correct: 600 examples
  3. ner.teach: 1200 examples
  4. ner.teach: 800 examples with the labels that occur most rarely

I started training a blank:de model and trained the model after each annotation step to monitor the metrics - always with the model of the previous training-step as input.
The F-Score went from 84 % in step 1 to 94 % to 91 % to 82 % after step 4. Also the metrics for the labels seem a bit erratic to me. The train-curve always increased except after step 2.

My Question: Am I doing something fundamentally wrong or do I just have to keep annotating e. g. by using ner.correct?

I really love working with Prodigy and would be extremly happy for any kind of advice.
Thanks in advance!

:sweat_smile: Love this, haven't heard that one before!

This is definitely interesting! When you trained from the binary annotations in step 3 and 4, did you make sure to set the --binary flag on prodigy train to make sure that they're interpreted as binary yes/no answers and the rejects are taken into account? Also, did you train on the total 2000 binary annotations (1200 + 800) or on the 1200 first and then on the 800? The latter could be problematic because if you're only updating with rare labels, the model might overfit on the new data.

1 Like

Uhh I didn't set the binary flag on :grimacing:

I always trained with all annotations as I keep them in one dataset, so 2800 examples.
Would it be smarter to keep the binary annotations in a seperate dataset and train the model seprately - one with the binary flag on and the other without? Training the model that was previously trained with the 800 non-binary annotations using the dataset with 2800 examples and the binary flag on didn't lead to a high accuracy (64 %).

I noticed that the metrics for the other more common labels now are pretty low and the ones for the rare-labels quite good. That could be a sign for overfitting, couldn't it? The marked labels in the screenshot are the rare ones.

My next step would simply be annotating more examples using ner.correct until the F-score is up again :upside_down_face:

Thank you so much for your help!
It's super interesting to get behind the logic.

Sorry for the confusion and the parallel workflows – it's certainly not ideal to have the two different types of training and it's something we're looking to unify in the future (and with the upgrade to spaCy v3).

Basically, the idea behind it is this: if you have annotations describing spans of text, there are two ways to update the named entity recognizer from them. One is to assume all unannotated tokens are not entities, and the other is assuming they're missing values. spaCy also allows updating the model with information like "I don't know what this token is but I know that it's not B-PERSON" (if you're interested in how this works in more detail, see my slides here). So this is what setting --binary does.

Yes, in general, I'd always recommend using more fine-grained datasets if possible. It's always easy to merge datasets later on (and Prodigy can take care of this for you when you train).

That's definitely interesting and worth keeping an eye on! How are you evaluating? Do you have a dedicated evaluation set? If not, that could be a good next step to work on: you want to make sure you're always evaluating on the same representative dataset so you can compare the results across runs, and properly compare the different approaches.

1 Like

Thank you for your extremely valuable input! Helped me a lot so far.

So far I always evaluated with the entire dataset and the default train-eval-splits but definately get the point of a dedicated evaluation-set!

I think using ner.teach is not the ideal approach for the type of data I deal with - quite short and consitent technical texts describing a part and the manufacturing steps to build it.
With annotating most texts with ner.teach I created a strong bias to certain entities that led to a dataset not suitable for training a good model. At least that is how I think it might be :smile:

My approach is now ner.correct - as the F-score was very good even at 800 examples - till the train-curve is happy. Afterwards I will add some ner.teach training-data in a seperate dataset to check the influence.
Does that sound reasonable? :slight_smile:

Is there an elegant way to convert a csv-file back to a jsonl? I got kinda stuck between pandas and srsly.
I exported the dataset using db-out and split of the 2000 ner.teach-examples and want to use the remaining 800 fully-annotated and collect some more to have a decent training dataset I can split the eval-data off.

Thank you so much in advance

Update: I created a separate dataset with 500 annotated examples for evaluation :slight_smile:

1 Like

Yes, that sounds good to me :+1:

Apparently pandas now has a lines flag on the to_json method, so maybe that helps? I found it in this thread, which looks a lot like someone converting patterns for Prodigy :grinning_face_with_smiling_eyes: