Annotation score drops

ryanwesslen · April 26, 2023, 11:45am

Thanks for your question.

Unfortunately, it's really hard to know without knowing a lot of context, for example:

what's your total sample size? adding 20% of 100 annotations won't do much, while 20% of 10,000 annotations we'd expect more robust results. From a glance, it looks like you may have a good number of annotations, so this may not be the sole reason.
were your new annotations inconsistent? (e.g., did you add new annotators and if so, did you provide them clear annotation guidelines or could they have added noise into their annotations)
related, how good quality are all of your annotations? did you do any "gold" / 2nd round reviews of your initial annotations (i.e., this is what the spans.correct recipe is for)?
I see you labeled with spans.manual yet are training with --ner -- why is that? Be careful as if you add overlapping entities with spans.manual you won't be able to train with --ner. This likely may not be the cause but it's something worth knowing to avoid problems in your workflow.
did you add any new labels? did you have any low record labels (e.g., let's say you had a span type with only 5 examples in your original run but got 100% F1, but then you added 5 more and now it's 50% but simply due to sample size)
have you run any span characteristics diagnostics? For example, see spacy data debug including the note on "span characteristics". This can have an effect on performance due to too long of spans and/or defining span boundaries.

Likely the biggest problem I see is that you're not using dedicated evaluation sets. We mention this in prodigy train docs:

For each component, you can provide optional datasets for evaluation using the eval: prefix, e.g. --ner dataset,eval:eval_dataset . If no evaluation sets are specified, the --eval-split is used to determine the percentage held back for evaluation.

If you don't specify a dedicated evaluation dataset, each time you rerun prodigy train, it'll shuffle the evaluation dataset. In this way, it can seem that the performance changed but the problem is you're changing how you're evaluating.

Here's a post where I go through some of that workflow:

Alternatively, you may want to consider switching to spacy train instead of prodigy train via the data-to-spacy recipe. This has many advantages as it'll create the dedicated evaluation sets (as spaCy binary files) and a config that you can modify if you want to do hyperparameter tuning down the road:

Have you tried train-curve? This recipe was developed to handling debugging "when to stop" annotation. This too would be best when using a dedicated evaluation dataset.

As I also mentioned, if you run data-to-spacy, running spacy data debug can get span characteristics. If it helps, the spaCy team has a great template project on evaluating spancat:

Topic		Replies	Views
what to do if train-curve shows slight decrease in last sample usage , best-practices , training	6	1109	June 8, 2022
Improve trained models with annotations usage , ner , training	3	521	September 20, 2021
More annotations worsen the F-score? usage , ner , best-practices	6	696	January 27, 2021
Annotation tasks finish even when more samples are in the jsonl dataset usage , solved , streams	5	446	April 8, 2022
Model vs Dataset Metric Weights usage , database , training	2	408	April 13, 2022

Annotation score drops

Related topics