Hi! I think it's definitely possible that this is what's going on, and that your model ends up overfitting on the new examples. One thing you could try here is including the previously created manual annotations, to remind the model of the other labels and correct predictions.
In Prodigy v1.11 (currently available as a nightly pre-release), the train
workflow supports training from both manual and binary annotations together, which can definitely help with this as well. You can then take advantage of the complete annotations and consider unannotated tokens here as "not an entity", while also including sparse binary feedback from the manual annotations.
It could also be worth exploring what exactly is different about those two labels. One thing to look at is obviously the frequencies – if it turns out that you only have very few examples of those entities, then that's definitely unideal, and it makes sense to add more examples that include those labels.
But it might also be helpful to perform some more in-depth error analysis to find out if there's a more general problem you can pinpoint, or a common error pattern. You could even do this by writing a small Prodigy recipe that streams in your evaluation data, processes the examples with your trained model and filters for all examples where the result for BRAND
is different from the correct annotation in the evaluation data. You could then add some options that let you annotate what the problem was, e.g. wrong label, incomplete span etc. (I'm doing something pretty similar in the second error analysis recipe in this video tutorial).
Maybe it turns out that the model mostly struggles with the distinction between DISTILLERY
, BOTTLER
and BRAND
, if those end up looking fairly similar. In that case, you could experiment with an approach that combines these categories and then uses additional logic for the final distinction. Or maybe you'll find that your evaluation data isn't actually representative and ended up with a disproportionate amount of ambiguous or "weird" examples for a particular label. This can happen with a random split, especially if your dataset is fairly small. With a small evaluation set, even 5 "weird" or messy examples can easily cost you 5% in accuracy.
P.S.: This is a fun use case! I was never much of a whisky drinker but I kinda got into it during the lockdowns