Insult classifier tutorial - Zero insults shown in classification step

I have been using Spacy for a couple of years but am new to Prodigy having purchased it yesterday.

I am following the insults classifier video. I have followed it precisely, with the only difference being that I used the en_core_web_lg file for vectors instead of the en_vectors_web_lg. Not sure if they contain the same vector data or not.

The first section of creating the extended list of insults worked well. However, within the second section whereby I classify whether a given sentence contains an insult, not a single sentence given contained anything even closely resembling an insult. I also used the exact same Reddit download as in the video.

I am unsure what I have done wrong, but it would appear that something is not working. Any suggestions appreciated.

Hi! Could you share the command you're running and an excerpt from the patterns you created using the insults terminology dataset? One possible explanation is that the insults patterns aren't matched, so all you see are the models suggestions, which are completely random (since the model hasn't really learned anything about insults yet).

Hi Ines,

Sure, I have attached the insults file itself. I started everything again, and on that occasion I received perhaps three sentences that contained insults, so improved but not by much. Also, I noticed that by around sentence 70, many of the sentences given had a score of 0.01, or value similar to that.

insults_patterns2.jsonl (2.7 KB)

In terms of the commands I used, they are as follows:

prodigy datasets insults_2 "Collect seed terms of insults classifier"
prodigy terms.teach insults_2 en_vectors_web_lg --seeds insults.txt (I redid it with the vectors model. insults.txt contained the same insults are in your video)
prodigy dataset insults "classify insults"
prodigy terms.to-patterns insults_2 ./insults_patterns.jsonl --label INSULTS --spacy-model en_core_web_lg I do not have the vectors attached to en core sm so used en core lg "out of the box"
prodigy textcat.teach insults3 en_core_web_lg RC_2010-12.bz2 --loader reddit --label INSULT --patterns insults_patterns2.jsonl

Is that what you were looking for?

Thanks in advance!

Thanks for the update! Your workflow looks correct, but I think I might have found a (very subtle!) problem:

If your patterns were created with the label INSULTS, but the textcat.teach recipe specifies the label INSULT, you will not see any suggestions from the patterns (because Prodigy will only look for patterns for the label INSULTS). This would also explain the behaviour you're seeing: all suggestions you get are suggestions from the model, which are pretty random, so it's not actually learning much as you annotate. Pattern matches will show the pattern number (line number) in the bottom left of the annotation card, and also highlight the matched word (a new feature we added after the video was created, to make it more obvious where the match comes from).

Thanks once again for your reply. I realise that I did actually type INSULT as singular in all instances on my machine, and made the error in the message above. In other words, the actual inputted commands were not containing the errors above.

Probably only around 1% of the time did a seed term appear within the annotation card - it was otherwise showing text which was absent of any seed term. I just started the whole process again to make sure there was no error like above - and it is the same results again unfortunately.

In case it is of any use:

I was experimenting making a dentistry list this evening. If I then run the result of the seed list through the ner.teach function

prodigy ner.teach dentistry_ner en_core_web_lg RC_2015-03.bz2 --loader reddit --label DENTISTRY --patterns dentistry_patterns.jsonl,

it provides example sentences that use the seed words found in the dentistry patterns jsonl file (attacheddentistry_patterns.jsonl (4.3 KB) ).

However, when running the same dentistry patterns jsonl file through the text classifier

prodigy textcat.teach dentistry en_core_web_lg RC_2015-03.bz2 --loader reddit --label DENTISTRY --patterns dentistry_patterns.jsonl ,

the example sentences never contain any of the items in the dentistry or anything remotely related.

Uploaded example screenshots from ner and textcat.

Thanks for the report.

The way that the seed-based bootstrapping works is, the results from the matcher are interleaved with the predictions from the model. The idea is that at first you'll only see results from the patterns, and then the predictions from the model will gradually come in, as the model learns the category from the patterns.

This is working correctly in the NER case, because the model doesn't start out predicting entities. And if you do text classification with non-mutually-exclusive classes, the model will also avoid predicting the labels initially.

The problem is that your model has exclusive classes, so the model's initial prediction is a score of 0.5 for each example. This means the initial examples are all from the model, rather than the patterns.

The problem should resolve itself if you click through a few batches of these random examples, rejecting them as incorrect predictions. You could also do an initial annotation session to start out with some number of predictions from the matcher first, before switching over to the combined model. This is actually what we used to do, which is why the version in the video behaves a little differently. However, the problem is that it's hard to guess what will work well on different problems, so we now try to avoid coding complicated behaviours into the default recipes. Instead I think it's usually better to provide simpler pieces, and let developers construct the desired behaviours themselves.

Prodigy is driven by recipe scripts, which you can either edit or author yourself. You can get a good set of starter recipes from the repo here: https://github.com/explosion/prodigy-recipes . I think you might prefer to add a flag to the textcat.teach recipe that lets you only use the PatternMatcher, rather than combining it with the model. You could then annotate with just the patterns for a while, and use that to train an initial model.