I’m currently working on a project to develop an insult classifier similar to what is being accomplished in the Reddit insult classifier tutorial. However the classifier will be trained on employer reviews in German. I wanted to share my experience and see if someone might have advice how I might get a little more traction in my project.
I started by doing some typical preprocessing steps o my corpus such as:
• Normalizing umlauts from ö to oe
• Lower casing the strings
• Stripping out excessive whitespaces.
I also trained my own word embedding using gensims W2V and Fasttext implementation although the size of the corpus is only a few million written responses.
I also generated a relatively large list of seed terms to try and help my model get started by first using prodigy find a set of ~300 seeds which were then transformed into patterns.
One problem I’ve noticed when I first started annotating is that insults do not appear very often in the corpus and thus I end up having a very high reject/accept ratio. I’ve already seen a few other posts that mention this cold start problem and the importance of at least initially having equally weighted classes. Since insults appear in maybe 1% of the entire corpus I was trying to just ignore a high percentage of the reject cases (where no insult was present) which was time consuming
Questions:
• With a relatively small corpus would you recommend training my own word embeddings, or sticking with word embeddings from CommonCrawl/etc.?
• Can you recommend other strategies to identify insults when the they appear very seldom? Right now I’ve basically just tried finding examples using exact string matches, however I’m concerned this is a bit fraught because then I’m focusing on a narrow type of positive examples.
• Related perhaps, I’ve noticed, even after my model can recognize some insults, when I use the prefer_certain functionality it seems to be working on batches of 10 examples, is there anyway to simply increase the batch size, or write similar functionality? I am a bit confused where or how I might try and modify the stream.
• The text I’d like to annotate varies in length from just a few words to multiple paragraphs. Can I assume that splitting examples into sentence level annotations will improve the classifiers performance?
Thanks for the support! I’m really impressed and grateful that you guys have developed such awesome NLP tools.
Thanks for the clear context. I hope we can help, although it's true that finding rare items is often very difficult for a classifier.
I think the Common Crawl vectors, e.g. the ones distributed from FastText, are probably fine. I wouldn't immediately see a reason to change from them. Note that I don't think the FastText vectors do the pre-processing you suggest, e.g. replacing umlauts. Actually I don't really see why replacing umlauts should be necessary, in general. Everything should work fine with utf8 text.
I think at the start you probably need an information-retrieval approach, where you pretty much just get a terminology list and search for the terms. If you can get a list that gives you about 50% precision, then hopefully you can bootstrap from there. Get documents with your search terms, mark accept/reject, and then train a model on the dataset. Then run the model over your texts, and have a look at the high-scoring predictions. You probably want to make separate scripts to do this, rather than working directly in Prodigy. This lets you work with larger batches of data, and lets you iterate on the script more quickly.
You can set the batch size in your prodigy.json file. But, as above, you might prefer to do the filtering and sorting in a separate process, rather than in the "online" way that Prodigy does. This lets you run the model over much more data. You can queue up streams of examples for Prodigy by simply writing a generator function, and passing this in your recipe as the "stream" entry in the components dict. See the example recipes here: GitHub - explosion/prodigy-recipes: 🍳 Recipes for the Prodigy, our fully scriptable annotation tool .
I think annotating sentence-by-sentence is probably the way to go, yes. You can always get a prediction over a longer text by averaging the predictions of the sentences (or perhaps taking the max of the predictions --- it probably makes sense to consider a text as insulting if it contains at least one insulting sentence.)
Hi I wanted to report back in case someone has another similar question.
After making an extensive list of keywords I managed to find a decent starting set of texts to annotate. After annotating a few thousand examples I’m seeing good improvement in performance:
One question I had was: initially, the dataset was still somewhat imbalanced, with approximately 33% of the data being accepted insults and the other 67% being rejected. However when I tried training the model I had a noticeably worse recall score.
Can the moderate class imbalance be hurting the overall performance of the model, or should I be reviewing the quality of the annotations?
About the class imbalance: The main thing to keep an eye on is whether your train/dev/test splits end up with divergent class balances. If you train on data with a 50/50 balance and test on data with a 90/10 balance, you’ll almost certainly see bad results.
Now that you’re getting encouraging accuracy, I would prioritise annotating evaluation data you can reuse across the lifespan of your project. Make sure you’re not using active learning when you’re annotating this: you want to draw a random set of examples, instead of using the model as part of the example selecting. You should also make sure the evaluation texts aren’t in the training data.
Once you have a randomly drawn evaluation set, it should be a bit easier to make directly comparable experiments, and reason about what to do next.