I’m about to train a Gensim model on millions of Swedish (‘sv’) texts and also include bigrams and trigrams and then use it to initiate a spacy model for using prodigy for tagging my own NER labels and also doing text classification.
According to your documentation on training a new language model I need to first cache the words used in the model in Vocab instance. However, I’ll be having bigrams and trigrams in the form e.g. bigram “svenska_institutet” and trigram “bästa_jag_vet”.
Im using gensim.models.Phrases to create bigrams and trigrams.
Since it’ll probably take days to train the gensim model on my whole corpus on my laptop, and possibly the same amount of time for running your recipe word_freqs.py, I’m thinking about wether I can skip the word frequency step.
Can I load a gensim model and somehow on the fly get the vocab updated? I suppose the bigrams and trigrams wn’t be included since they’re words separated by underscore; “w1_w2”. Would that make the model faulty?
I suppose I could tweak your word_freqs.py reciepe to also count bigrams and trigrams with the python module re if necessary.
How would you recommend me to go about to get the full power of spacy for NER and text classification?