Insult classifier tutorial - Zero insults shown in classification step

I have been using Spacy for a couple of years but am new to Prodigy having purchased it yesterday.

I am following the insults classifier video. I have followed it precisely, with the only difference being that I used the en_core_web_lg file for vectors instead of the en_vectors_web_lg. Not sure if they contain the same vector data or not.

The first section of creating the extended list of insults worked well. However, within the second section whereby I classify whether a given sentence contains an insult, not a single sentence given contained anything even closely resembling an insult. I also used the exact same Reddit download as in the video.

I am unsure what I have done wrong, but it would appear that something is not working. Any suggestions appreciated.

Hi! Could you share the command you're running and an excerpt from the patterns you created using the insults terminology dataset? One possible explanation is that the insults patterns aren't matched, so all you see are the models suggestions, which are completely random (since the model hasn't really learned anything about insults yet).

Hi Ines,

Sure, I have attached the insults file itself. I started everything again, and on that occasion I received perhaps three sentences that contained insults, so improved but not by much. Also, I noticed that by around sentence 70, many of the sentences given had a score of 0.01, or value similar to that.

insults_patterns2.jsonl (2.7 KB)

In terms of the commands I used, they are as follows:

prodigy datasets insults_2 "Collect seed terms of insults classifier"
prodigy terms.teach insults_2 en_vectors_web_lg --seeds insults.txt (I redid it with the vectors model. insults.txt contained the same insults are in your video)
prodigy dataset insults "classify insults"
prodigy insults_2 ./insults_patterns.jsonl --label INSULTS --spacy-model en_core_web_lg I do not have the vectors attached to en core sm so used en core lg "out of the box"
prodigy textcat.teach insults3 en_core_web_lg RC_2010-12.bz2 --loader reddit --label INSULT --patterns insults_patterns2.jsonl

Is that what you were looking for?

Thanks in advance!

Thanks for the update! Your workflow looks correct, but I think I might have found a (very subtle!) problem:

If your patterns were created with the label INSULTS, but the textcat.teach recipe specifies the label INSULT, you will not see any suggestions from the patterns (because Prodigy will only look for patterns for the label INSULTS). This would also explain the behaviour you're seeing: all suggestions you get are suggestions from the model, which are pretty random, so it's not actually learning much as you annotate. Pattern matches will show the pattern number (line number) in the bottom left of the annotation card, and also highlight the matched word (a new feature we added after the video was created, to make it more obvious where the match comes from).

Thanks once again for your reply. I realise that I did actually type INSULT as singular in all instances on my machine, and made the error in the message above. In other words, the actual inputted commands were not containing the errors above.

Probably only around 1% of the time did a seed term appear within the annotation card - it was otherwise showing text which was absent of any seed term. I just started the whole process again to make sure there was no error like above - and it is the same results again unfortunately.

In case it is of any use:

I was experimenting making a dentistry list this evening. If I then run the result of the seed list through the ner.teach function

prodigy ner.teach dentistry_ner en_core_web_lg RC_2015-03.bz2 --loader reddit --label DENTISTRY --patterns dentistry_patterns.jsonl,

it provides example sentences that use the seed words found in the dentistry patterns jsonl file (attacheddentistry_patterns.jsonl (4.3 KB) ).

However, when running the same dentistry patterns jsonl file through the text classifier

prodigy textcat.teach dentistry en_core_web_lg RC_2015-03.bz2 --loader reddit --label DENTISTRY --patterns dentistry_patterns.jsonl ,

the example sentences never contain any of the items in the dentistry or anything remotely related.

Uploaded example screenshots from ner and textcat.