German short text textcat training - compound splitting?

simon · September 30, 2020, 9:34pm

Hi,

I am using a textcat spacy pipeline to train a German short text prediction model with 2k classes (Using exclusive_classes=True). According to the documentation https://spacy.io/api/textcategorizer ensemble uses a CNN + ngram bag of words model.

If I get it right the CNN part uses the word vectors of the token. Therefore I need the largest German model, de_core_news_lg, to ensure that most German words are covered by token vectors, correct?

However I still have the problem, that a lot of important words (my texts are very shorts therefore every word counts) have no vectors and only one token. E.g. nlp('produktmanagement').vector or nlp('softwareentwickler').vector deliveres an empty vector.

This particular happens for composed words, which as we know, are very common in German. nlp('produkt management').vector delivers a vector. In my context, would it make sense to apply compound splitting on the short text to improve the results? Are there other improvements I could use for a German short text classification with 2k classes?

Best,
S

simon · October 1, 2020, 9:35am

Update: i tried https://github.com/dtuggener/CharSplit but it seems to worsen the results.

honnibal · October 7, 2020, 7:54pm

Hi Simon,

spaCy's model will learn representations for unknown words from the training data, so you don't necessarily need to have seen all the words in the word vectors table. If the training data is small, the model might struggle a little bit. Unfortunately the v2 models are a bit too much tuned for English, and they have trouble on languages like German when there's not much training data.

You might experiment a little with the spacy pretrain command, which might improve your results if you have enough unlabelled text. We've been working hard on v3 of spaCy which will make it much easier to use transformer models, and also make it much easier to customize details of the model.

Topic		Replies	Views
Sentiment of single words/phrases usage , textcat , spacy , solved	2	1035	May 2, 2019
Categorisation in foreign languages - are word vectors enough? textcat , spacy	2	630	May 3, 2019
Do the outputted models using textcat.batch-train make use of word vectors? usage , textcat , spacy	2	595	March 28, 2019
Spacy pretrain best practices usage , done , spacy	16	5280	March 13, 2020
Traning/validation in Textcat/ textcat , spacy , off-topic	0	1181	May 26, 2020

German short text textcat training - compound splitting?

Related topics