German sense2vec model

Hi, for my project (briefly described in my previous question here), I am wondering whether it would make more sense to train a vector space model specifically trained on literary texts. This would ideallly increase the accuracy of training for NER and sentiment analysis, right? Would it also be the way to go for using transformer technology? (I am obviously quite inexperienced and not sure what would be best).
If a custom sense2vec is the best option, and having obtained the whole collection of the german Project Gutenberg (16,000 books in html format), could I use this to train a "literary" sense2vec model? How?

Also, I have noticed that fastText has a model trained for German, and I right understanding this has been trained on Wikipedia texts? How would the use of a vector space model make my investigation better in comparison to the provided spacy large German model?

The fastText vectors are trained on Common Crawl and Wikipedia, see the description at the top of Word vectors for 157 languages · fastText. The current (v3.1.0) de_core_news_md/lg vectors are trained on OSCAR (also Common Crawl) and Wikipedia, so I would expect them to be fairly similar overall.

Some differences:

  • The fastText vectors have different preprocessing and tokenization.
  • The de_core vectors have the exact same tokenization as de/German in spacy.
  • The fastText vectors are trained on a larger amount of data than the de_core vectors, so the performance may be better despite a few more OOV tokens due to tokenization differences.

You may get better performance by training on your custom texts, but you'd have to try it out for your training data and downstream task. I suspect that Project Gutenberg alone is not going to be enough training data for good fastText vectors (but I really don't know and it really depends on the downstream task!), but you should definitely try it out and compare, and you can potentially combine Project Gutenberg with other sources.

Great, many thanks! And what about this severinsimmler/literary-german-bert · Hugging Face? Would it be possible to use it as a language model within a prodigy recipe?

You can try it out (with prodigy v1.11), but I have no idea if that particular model will be useful.

I think you can get started with prodigy train if you specify a custom config.cfg that contains transformer+ner and the right transformer model. Use spacy init config -p ner -l de -G config.cfg and then edit the settings in [components.transformer.model].

I haven't actually tried this myself, though, because I don't have a good prodigy dataset to train from for testing. It worked fine with a similar approach using just spacy train instead of prodigy train, and then using ner.teach with the resulting model.