Problems integrating binary data into my model

Extreme newbie here. Working a classifier to take auction results and extract details from them, Prodigy has been working excellently.

My workflow was to start out with a patterns file that I generated from a dataset I had and then some manual training with:

prodigy ner.manual whisky_ner blank:en /datasets/auctions.jsonl --patterns /datasets/patterns.jsonl --label BRAND,DISTILLERY,BOTTLER,STATED_AGE,VINTAGE,BOTTLED,CASK_NUMBER,STRENGTH,VOLUME

I hand classified about 500 items and then ran training:

prodigy train ner whisky_ner en_vectors_web_lg --output whisky_model

After that I did some further manual matching using ner.correct:

prodigy ner.correct whisky_ner_correct /models/whisky_model -U --exclude whisky_ner

That worked pretty well, after 500-1000 annotations I have reasonable results:

Label         Precision   Recall   F-Score
-----------   ---------   ------   -------
STRENGTH         95.111   94.690    94.900
DISTILLERY       92.727   97.143    94.884
VOLUME           75.824   76.667    76.243
BOTTLED          96.040   94.175    95.098
VINTAGE          90.678   94.690    92.641
BOTTLER          81.250   79.592    80.412
STATED_AGE       98.726   98.726    98.726
CASK_NUMBER      90.566   87.273    88.889
BRAND            47.619   33.333    39.216

It seems obvious I need to put some extra effort into refining the precision on VOLUME and BRAND.

My first attempt at this was to follow approaches in several of the video tutorials, one of which suggested using ner.teach with a single label, which I did like this:

prodigy ner.teach whisky_ner_binary /models/whisky_model /datasets/auctions.jsonl --exclude whisky_ner --label BRAND -U

I ran through about 1000 examples there until the % was > 90%. After that I tried a few things, none of which resulted in a model that was better:

  1. I tried applying the binary annotations directly to the model with prodigy train ner whisky_ner_binary /models/whisky_model --output /models/whisky_model_post_binary --binary, but the result seemed radically worse (I assume lots of annotations with only the BRAND label derailed earlier learnings from the model).

  2. Messed around with ner.silver-to-gold to create a new data set, which I then applied to the model. I wasn't sure how much of the manual tagging I needed to do on that dataset, it was time consuming as there wasn't any of the other annotations to rely on. The result when trained into the model didn't seem any better.

Am I going about this wrong? Any suggestions?

Hi! I think it's definitely possible that this is what's going on, and that your model ends up overfitting on the new examples. One thing you could try here is including the previously created manual annotations, to remind the model of the other labels and correct predictions.

In Prodigy v1.11 (currently available as a nightly pre-release), the train workflow supports training from both manual and binary annotations together, which can definitely help with this as well. You can then take advantage of the complete annotations and consider unannotated tokens here as "not an entity", while also including sparse binary feedback from the manual annotations.

It could also be worth exploring what exactly is different about those two labels. One thing to look at is obviously the frequencies – if it turns out that you only have very few examples of those entities, then that's definitely unideal, and it makes sense to add more examples that include those labels.

But it might also be helpful to perform some more in-depth error analysis to find out if there's a more general problem you can pinpoint, or a common error pattern. You could even do this by writing a small Prodigy recipe that streams in your evaluation data, processes the examples with your trained model and filters for all examples where the result for BRAND is different from the correct annotation in the evaluation data. You could then add some options that let you annotate what the problem was, e.g. wrong label, incomplete span etc. (I'm doing something pretty similar in the second error analysis recipe in this video tutorial).

Maybe it turns out that the model mostly struggles with the distinction between DISTILLERY, BOTTLER and BRAND, if those end up looking fairly similar. In that case, you could experiment with an approach that combines these categories and then uses additional logic for the final distinction. Or maybe you'll find that your evaluation data isn't actually representative and ended up with a disproportionate amount of ambiguous or "weird" examples for a particular label. This can happen with a random split, especially if your dataset is fairly small. With a small evaluation set, even 5 "weird" or messy examples can easily cost you 5% in accuracy.

P.S.: This is a fun use case! I was never much of a whisky drinker but I kinda got into it during the lockdowns :tumbler_glass:

Thanks for the detailed answer @ines! I was also thinking about going down the path of merging DISTILLERY , BOTTLER and BRAND.

I'll watch the video you suggested, and I've also signed up for the nightly program.

Cheers! :tumbler_glass:

1 Like

Cool, let us know how you go! :tumbler_glass:

You could also consider adding an EntityRuler or Matcher-based component that handles the most common brands and distilleries. What you can take advantage of here is that a) popular brands are likely to also be most common in your data because they're talked about more and b) whisky is something people take quite seriously, it's regulated and probably well-catalogued. So maybe there's even an existing database you can query for this. Your model will obviously still have to deal with unknown brands, spelling variations and edge cases, but there might be a lot of low-hanging fruit that you can cover with a simple dictionary approach to boost accuracy.