mutually exclusive classes and textcat.batch-train

sveda · June 25, 2019, 11:06am

Hi There,

I want to use prodigy (or spacy if its’ easier) to train a classifier on a bunch of annotations (10k) with 10 mutually exclusive classes.
When I import my .jsonl file to a prodigy dataset I see that answer=‘accept’ is added for all entries. And when training using batch-train I get 100% accuracy in the first iteration.

I’ve found that training works if I iterate over my dataset and add 9 reject entries (for all the other classes) for every accept entry in the dataset, but I find this process tedious and also the training time is greatly increased now with 100k entries instead of 10k. Is this how you are supposed to do this, or is there a better way?

seb · June 25, 2019, 12:52pm

Did you use the --exclusive label in textcat.batch-train recipe?

But if you only have “accept” and no “reject” answers I think your results make sense. You could use dummy data for the reject cases.

sveda · June 26, 2019, 9:26am

Thanks for the suggestion, I ran textcat with the–exclusive flag but the results was terrible:
accept accept 413
accept reject 0
reject reject 0
reject accept 1643

Correct 413
Incorrect 1643

Baseline 0.09
Precision 1.00
Recall 0.20
F-score 0.33
Accuracy 0.20

The 1643 rejects makes no sense to me, I would like the classifier to always give me a accept prediction for one of these 10 classes, the one with the highest score, is this possible?

honnibal · June 26, 2019, 2:13pm

Which version of Prodigy are you using? Also, can you print one example record, after the convert_options_to_cats() function has run in the textcat.batch-train recipe?

What we want is a cats object’s a dict, keyed by your different labels. We want to see one entry with 1.0 as its value, and all the other labels with 0.0 as their value. spaCy’s textcat accepts this dict format so that we can deal with missing values, or cases where multiple labels might be true.

You do want the --exclusive flag. It just sounds like something’s going wrong with the data processing. Some early version of Prodigy v1.8 had some bugs around the --exclusive argument of the textcat.batch-train recipe, so I think that might be what’s happening here.

sveda · June 27, 2019, 8:30am

You are right, both with and without the --exclusive flag there is only one category.
I thought this was fixed in v.1.8.3 though? (I use version 1.8.3)…
Here is the output with the --exclusive flag:

Loaded blank model
{‘label’: ‘15900’, ‘text’: ‘Tanken är att försöken ska inledas redan nästa år. Geofencing är en teknisk lösning som innebär att endast tillåtna fordon kan köra inom ett geografiskt område. Tekniken kan exempelvis användas för att begränsa hastigheten, skriver regeringen i ett pressmeddelande. - Det känns som ett lovande initiativ, säger Anna Johansson vid en presskonferens. Efter terrordådet med en lastbil på Drottninggatan i Stockholm i april bjöd Anna Johansson in branschaktörerna för att diskutera vad man kan göra för att förhindra att det upprepas. Nu har ett uppföljningsmöte hållits och vid det enades flera parter om att försöka ta fram digitala lösningar. Var försöken med geofencing ska hållas är inte klart ännu. Göteborgs stad och Stockholms stad ska tillsammans med Volvo Cars, Volvokoncernen, Scania och Trafikverket ingå i en gemensam styrgrupp. Myndigheten Trafikanalys har också fått i uppdrag av regeringen att bland annat föreslå åtgärder för hur tunga fordon ska användas säkert i stadsmiljöer och undersöka vilka alternativ som finns för att transportera gods säkert och miljöanpassat. FAKTA Bakgrund: Terrordådet på Drottninggatan På eftermiddagen fredagen 7 april i år kom ett larm om att en lastbil kört på människor på Drottninggatan i centrala Stockholm och därefter kraschat in i varuhuset Åhléns. Det visade sig att en man stulit lastbilen och sedan kört längs Drottninggatan. Fem människor dödades i dådet. Tre avled på Drottninggatan och två på sjukhus. Den misstänkte gärningsmannen, Rakhmat Akilov från Uzbekistan, greps några timmar efter dådet. Han är nu häktad misstänkt för terroristbrott. Vid häktningsförhandlingen erkände han, via sin advokat. Lastbilar som vapen har använts vid flera tillfällen, till exempel: I januari i år dödades fyra soldater när en lastbil körde in i en folksamling i Jerusalem. I december i fjol miste tolv personer livet när en lastbil körde in bland människorna på en julmarknad i Tysklands huvudstad Berlin. Den 14 juli, Frankrikes nationaldag, förra året dödades 84 människor när en lastbil körde in i en folkmassa på strandpromenaden i Nice i Frankrike. Läs mer TT.’, ‘_task_hash’: 2039717295, ‘answer’: ‘accept’, ‘cats’: {‘15900’: 1.0}, ‘_input_hash’: -549308992}

honnibal · July 1, 2019, 8:54pm

Yes I thought that was fixed too. Hmm. I’ll look into how that could have happened. In the meantime, you could fix the cats dict to work around the problem?

Topic		Replies	Views
Textcat possible problem with uneven dataset? usage , textcat , done	2	956	January 17, 2020
Train a textcat model after it has been 'prodigy.teach'ed with 3 labels usage , textcat	5	575	November 16, 2020
textcat vs textcat_multilabel usage , textcat , training	12	3271	September 13, 2023
Is bath training on labels mutually exclusive? usage , textcat , done , spacy	5	735	July 9, 2019
Exclusive categories not working as expected usage , textcat , solved	2	349	June 29, 2021

mutually exclusive classes and textcat.batch-train

Related topics