I’m currently working on a project to develop an insult classifier similar to what is being accomplished in the Reddit insult classifier tutorial. However the classifier will be trained on employer reviews in German. I wanted to share my experience and see if someone might have advice how I might get a little more traction in my project.
I started by doing some typical preprocessing steps o my corpus such as:
• Normalizing umlauts from ö to oe
• Lower casing the strings
• Stripping out excessive whitespaces.
I also trained my own word embedding using gensims W2V and Fasttext implementation although the size of the corpus is only a few million written responses.
I also generated a relatively large list of seed terms to try and help my model get started by first using prodigy find a set of ~300 seeds which were then transformed into patterns.
One problem I’ve noticed when I first started annotating is that insults do not appear very often in the corpus and thus I end up having a very high reject/accept ratio. I’ve already seen a few other posts that mention this cold start problem and the importance of at least initially having equally weighted classes. Since insults appear in maybe 1% of the entire corpus I was trying to just ignore a high percentage of the reject cases (where no insult was present) which was time consuming
• With a relatively small corpus would you recommend training my own word embeddings, or sticking with word embeddings from CommonCrawl/etc.?
• Can you recommend other strategies to identify insults when the they appear very seldom? Right now I’ve basically just tried finding examples using exact string matches, however I’m concerned this is a bit fraught because then I’m focusing on a narrow type of positive examples.
• Related perhaps, I’ve noticed, even after my model can recognize some insults, when I use the
prefer_certain functionality it seems to be working on batches of 10 examples, is there anyway to simply increase the batch size, or write similar functionality? I am a bit confused where or how I might try and modify the stream.
• The text I’d like to annotate varies in length from just a few words to multiple paragraphs. Can I assume that splitting examples into sentence level annotations will improve the classifiers performance?
Thanks for the support! I’m really impressed and grateful that you guys have developed such awesome NLP tools.