Extreme newbie here. Working a classifier to take auction results and extract details from them, Prodigy has been working excellently.
My workflow was to start out with a patterns file that I generated from a dataset I had and then some manual training with:
prodigy ner.manual whisky_ner blank:en /datasets/auctions.jsonl --patterns /datasets/patterns.jsonl --label BRAND,DISTILLERY,BOTTLER,STATED_AGE,VINTAGE,BOTTLED,CASK_NUMBER,STRENGTH,VOLUME
I hand classified about 500 items and then ran training:
prodigy train ner whisky_ner en_vectors_web_lg --output whisky_model
After that I did some further manual matching using ner.correct
:
prodigy ner.correct whisky_ner_correct /models/whisky_model -U --exclude whisky_ner
That worked pretty well, after 500-1000 annotations I have reasonable results:
Label Precision Recall F-Score
----------- --------- ------ -------
STRENGTH 95.111 94.690 94.900
DISTILLERY 92.727 97.143 94.884
VOLUME 75.824 76.667 76.243
BOTTLED 96.040 94.175 95.098
VINTAGE 90.678 94.690 92.641
BOTTLER 81.250 79.592 80.412
STATED_AGE 98.726 98.726 98.726
CASK_NUMBER 90.566 87.273 88.889
BRAND 47.619 33.333 39.216
It seems obvious I need to put some extra effort into refining the precision on VOLUME
and BRAND
.
My first attempt at this was to follow approaches in several of the video tutorials, one of which suggested using ner.teach
with a single label, which I did like this:
prodigy ner.teach whisky_ner_binary /models/whisky_model /datasets/auctions.jsonl --exclude whisky_ner --label BRAND -U
I ran through about 1000 examples there until the % was > 90%. After that I tried a few things, none of which resulted in a model that was better:
-
I tried applying the binary annotations directly to the model with
prodigy train ner whisky_ner_binary /models/whisky_model --output /models/whisky_model_post_binary --binary
, but the result seemed radically worse (I assume lots of annotations with only theBRAND
label derailed earlier learnings from the model). -
Messed around with
ner.silver-to-gold
to create a new data set, which I then applied to the model. I wasn't sure how much of the manual tagging I needed to do on that dataset, it was time consuming as there wasn't any of the other annotations to rely on. The result when trained into the model didn't seem any better.
Am I going about this wrong? Any suggestions?