Hi everyone, I'm just getting started Prodigy having moved from AutoML. Loving Prodigy / Spacy. I was wondering if this behaviour was expected below.
We've been training a custom model using the following;
prodigy textcat.manual business_model ./BusinessData.txt --label account,contact_info,online,other,sales,service,tracking --exclusive
Our BusinessData.txt file contains verbatim from IVR, some examples below;
I would like to link my account
the amount that was paid into the account
I would like to speak to the credit manager
I would like to track the status of my shipment
So far we've been using the manual recipe to assign one exclusive label.
After classifying probably a thousand records we trained a model;
prodigy train textcat business_model blank:en --output ./models/business_model
We then load this model with the spacy.load via python shell.
What we have observed is that the model is being biased towards the more frequently used labels. In some cases, even when using the exact phrase that was used for a categorisation ie.
{"text":"I would like to make a complaint about a delivery driver","_input_hash":442155054,"_task_hash":-1466638285,"options":[{"id":"account","text":"account"},{"id":"contact_info","text":"contact_info"},{"id":"online","text":"online"},{"id":"other","text":"other"},{"id":"sales","text":"sales"},{"id":"service","text":"service"},{"id":"tracking","text":"tracking"}],"_session_id":null,"_view_id":"choice","accept":["other"],"answer":"accept"}
It's being wrongly classified;
{
"service": 0.07783517241477966,
"online": 0.0007357532740570605,
"account": 0.0011064341524615884,
"tracking": 0.6002848148345947,
"other": 0.017688263207674026,
"sales": 0.07158508896827698,
"contact_info": 0.0007702212897129357
}
Below is the output from our train;
Label ROC AUC
service 0.817
online 0.445
account 0.947
tracking 0.914
other 0.788
sales 0.882
contact_info 0.234
Best ROC AUC 0.718
Baseline 0.421
Can I ask, is there something we've done wrong - or does the model just require further training? We've run the train-curve command and it does suggest further training, but the accuracy only improves by 0.01 in the last run. Some runs it's actually a negative.
If we haven't done anything wrong, what would you suggest at this point? textcat.teach perhaps? And provide some training for the labels that are used less frequently?
Also - as a side note, is there a command we can run to see how many labels of each exist in a dataset? At present we're doing a db-out and figuring this out from the jsonl file.
Thanks very much!