I’m trying to build a custom sense2vec
model using spacy and the code available on the github. I am using a subset of the original source you guys used (Reddit data dumps). I’ve pre-processed those already and output everything into a single .xz
file with just the subreddits that I want to include in the modelspace. Ultimately, I would like to be able to layer on an additional 100 million domain-specific tweets I have as well.
I’ve modified the merge_text.py
on github to fix some incompatibilities/regression issues with the newest version of Spacy as well as to read .xz
compressed archives. Once that finishes (which is taking a horridly long time) I can then use those output files (which appear to be a bunch of batch-numbered .txt
files as input for the train_word2vec.py
script.
Here’s the questions:
-
Am I on the right track here?
-
Once I have the custom vector model, can I merge that and leverage existing POS and/or dependency models?
-
My ultimate goal is to then enhance/add new entities to an
NER
model as well as multiple newtextcat
models, essentially a full-set of domain-specific optimized models. Is this possible? If so, what’s the high-level workflow for doing that? -
Final question… I specified
-n 6
on themerge_text.py
script, but it seems to be taking up significantly more than 6 threads/cores. This machine is a fairly speedy Ryzen 1950x and its maxing out all 16 cores at them moment and has been running almost 24 hrs. I left most of the partition settings at the default, so does this mean that it’s basically bundling a single text file for each 200,000 docs or sentences?
I’m a fairly good self-learner but I’ve been out of NLP for a long while (original experience was manually creating TFIDF matrices using old-school SAS) so I’m still getting back up to speed on all the latest stuff in Spacy and Prodigy (I love both, BTW).
Any suggestions or answers would be incredibly helpful