I’m about to train a Gensim model on millions of Swedish (‘sv’) texts and also include bigrams and trigrams and then use it to initiate a spacy model for using prodigy for tagging my own NER labels and also doing text classification.
According to your documentation on training a new language model I need to first cache the words used in the model in Vocab instance. However, I’ll be having bigrams and trigrams in the form e.g. bigram “svenska_institutet” and trigram “bästa_jag_vet”.
Since it’ll probably take days to train the gensim model on my whole corpus on my laptop, and possibly the same amount of time for running your recipe word_freqs.py, I’m thinking about wether I can skip the word frequency step.
Can I load a gensim model and somehow on the fly get the vocab updated? I suppose the bigrams and trigrams wn’t be included since they’re words separated by underscore; “w1_w2”. Would that make the model faulty?
I suppose I could tweak your word_freqs.py reciepe to also count bigrams and trigrams with the python module re if necessary.
How would you recommend me to go about to get the full power of spacy for NER and text classification?
Training the Gensim model with bigrams and trigrams can be very useful for lots of purposes, but depending on what you’re doing, it might not do what you expect when you use that in combination with spaCy.
spaCy’s NER, tagger, text classifier etc look up the word vector for each token in the Doc object. So in order to use bigrams and trigrams, you’d need to make sure the tokenization matches. Basically, you can decide to merge certain tokens so that you get a longer token, and then you can look up that token in the word vectors as normal — you would just have some words with spaces in them.
If you’re not merging the bigrams and trigrams in a particularly principled way, this isn’t that useful as a pre-process for NER, and definitely not for parsing and tagging. Where it can really help is in building terminology lists: having vectors for longer phrases makes the terms.teach recipe much more useful, leading to better patterns files to make rule-based NER lists, which helps in annotation.
Finally, a word of advice: laptops really aren’t designed to run multi-day compute tasks, and they tend to really suck at this. They always want to go to sleep, you have to leave them plugged in, heat can become a problem, etc. If it crashes you have to restart the job. I would recommend setting up a remote machine, which you would log into, run tmux or screen, and start the task. You may be able to get a couple of hundred in free Google Compute Engine credits. Try Hetzner: https://www.hetzner.com/cloud . You probably don’t need to spend more than 15 a month, and it will save you a lot of trouble.
By the way, another strategy for training the models you’re interested in more quickly might be to count term cooccurrences using Gensim’s Bounter library: https://github.com/RaRe-Technologies/bounter . Because the counts are approximate, you can do this with a fixed memory budget, regardless of how much data you have. You would then use the counts to train GloVe models.
Thanks for the tips on Gensims Bounter library, really useful!
OK, so let's say I count freqs for single tokens only and train a word2vec model in Gensim on single tokens only. Would I be able to merge certain tokens afterwards and still have the single tokens left in the Vocab in spacy? Is that what you mean? How would I do that using spacy?
If you train the word2vec model on single tokens, then when you merge the tokens later, you won’t be able to find your merged tokens in the vectors.
You can think of the vectors, frequencies etc as just dictionaries. You’re going to do something like: tokens = split_text(text); vectors = [lookup[key] for key in tokens]. So you need the text to be split the same way before training the word vectors as you’ll use in your spaCy pipeline.