German short text textcat training - compound splitting?


I am using a textcat spacy pipeline to train a German short text prediction model with 2k classes (Using exclusive_classes=True). According to the documentation ensemble uses a CNN + ngram bag of words model.

If I get it right the CNN part uses the word vectors of the token. Therefore I need the largest German model, de_core_news_lg, to ensure that most German words are covered by token vectors, correct?

However I still have the problem, that a lot of important words (my texts are very shorts therefore every word counts) have no vectors and only one token. E.g. nlp('produktmanagement').vector or nlp('softwareentwickler').vector deliveres an empty vector.

This particular happens for composed words, which as we know, are very common in German. nlp('produkt management').vector delivers a vector. In my context, would it make sense to apply compound splitting on the short text to improve the results? Are there other improvements I could use for a German short text classification with 2k classes?


Update: i tried but it seems to worsen the results.

Hi Simon,

spaCy's model will learn representations for unknown words from the training data, so you don't necessarily need to have seen all the words in the word vectors table. If the training data is small, the model might struggle a little bit. Unfortunately the v2 models are a bit too much tuned for English, and they have trouble on languages like German when there's not much training data.

You might experiment a little with the spacy pretrain command, which might improve your results if you have enough unlabelled text. We've been working hard on v3 of spaCy which will make it much easier to use transformer models, and also make it much easier to customize details of the model.