Hi,
I am using a textcat spacy pipeline to train a German short text prediction model with 2k classes (Using exclusive_classes=True). According to the documentation https://spacy.io/api/textcategorizer ensemble
uses a CNN + ngram bag of words model.
If I get it right the CNN part uses the word vectors of the token. Therefore I need the largest German model, de_core_news_lg, to ensure that most German words are covered by token vectors, correct?
However I still have the problem, that a lot of important words (my texts are very shorts therefore every word counts) have no vectors and only one token. E.g. nlp('produktmanagement').vector or nlp('softwareentwickler').vector deliveres an empty vector.
This particular happens for composed words, which as we know, are very common in German. nlp('produkt management').vector delivers a vector. In my context, would it make sense to apply compound splitting on the short text to improve the results? Are there other improvements I could use for a German short text classification with 2k classes?
Best,
S