Workflow re: Custom Sense2Vec on New Data

I’m trying to build a custom sense2vec model using spacy and the code available on the github. I am using a subset of the original source you guys used (Reddit data dumps). I’ve pre-processed those already and output everything into a single .xz file with just the subreddits that I want to include in the modelspace. Ultimately, I would like to be able to layer on an additional 100 million domain-specific tweets I have as well.

I’ve modified the merge_text.py on github to fix some incompatibilities/regression issues with the newest version of Spacy as well as to read .xz compressed archives. Once that finishes (which is taking a horridly long time) I can then use those output files (which appear to be a bunch of batch-numbered .txt files as input for the train_word2vec.py script.

Here’s the questions:

  • Am I on the right track here?

  • Once I have the custom vector model, can I merge that and leverage existing POS and/or dependency models?

  • My ultimate goal is to then enhance/add new entities to an NER model as well as multiple new textcat models, essentially a full-set of domain-specific optimized models. Is this possible? If so, what’s the high-level workflow for doing that?

  • Final question… I specified -n 6 on the merge_text.py script, but it seems to be taking up significantly more than 6 threads/cores. This machine is a fairly speedy Ryzen 1950x and its maxing out all 16 cores at them moment and has been running almost 24 hrs. I left most of the partition settings at the default, so does this mean that it’s basically bundling a single text file for each 200,000 docs or sentences?

I’m a fairly good self-learner but I’ve been out of NLP for a long while (original experience was manually creating TFIDF matrices using old-school SAS) so I’m still getting back up to speed on all the latest stuff in Spacy and Prodigy (I love both, BTW).

Any suggestions or answers would be incredibly helpful

I'm sorry that code's so unmaintained. We do have a prodigy terms.train-vectors command, which does the phrase-merging in the same step. However, doing everything in the same script isn't necessarily a virtue.

For a large-scale job, I think what you're doing is best: work out your ETL pipeline, make each process relatively small so you can deal with failures, and orchestrate the work with whatever tooling you're already using for your other work.

How many words-per-second are you seeing per process -- and is each process burning up tonnes of threads? If you're running this on cloud computing, by far the cheapest way is to keep the VMs fairly small. On the Google Compute Engine n1-standard-1 machine type, you should see between something like 5000 words per second. If you're seeing much less than that, it means you're using a BLAS library that's not compiled well for the target machine. Installing numpy via conda before installing spaCy is a simple solution. spaCy 2.1 is a fair bit faster, and will make this much easier and more transparent.

Yes. I think automating big tasks is a pain for everyone. I'm told Kubernetes solves this problem well, but introduces interesting new Kubernetes-specific problems instead. I'm not yet sure whether this is a net win.

Possibly not, depending on what you mean by that. You can't very easily use the Sense2vec-powered vectors as features in the parser or tagger. The parser and tagger assume that the input vectors are static, and refer to the original token sequence. You could construct a pipeline that did use the sense2vec vectors, but you'd have to do a bit more work.

You should be able to use the sense2vec vectors within spaCy as a semantic model, though. What you would do is add a pipeline component that set a custom function to the doc.user_hooks dictionary, specifically to doc.user_hooks['vector'], doc.user_span_hooks['vector'] and doc.user_token_hooks['vector'] getter functions (and the same for 'similarity'). This way, you could customize how the vector is assigned and how the similarity calculation is performed.

You might also want to add custom attributes using the "underscore attributes", which you can set via e.g. Token.set_extension('pos_vector', getter=my_lookup_function). This will give you an attribute token._.pos_vector. This can be useful if you want other ways of accessing the sense2vec data from the token/POS/span objects.

Finally, if you merge the tokens so that they naturally key into the vector table, you'll be able to train new Prodigy models with the sense2vec vectors. However, you won't be able to use the pre-trained parser, POS tagger etc models anymore if you do that, as you'll have created a very different view of the doc, and the previous features in the pre-trained model won't be valid.

1 Like

Hi I hope you are doing well, I also want to train a custom sesne2vec on my corpus ( series of historical books ) I got a bit confused? can you let me know how you did it? I have a custom NER model on spacy made by prodigy , I have my data as jsonl

-can I use my pretrain NER model and make sense2vec on that?
-should I convert my jsonl data ?

I already answered your questions in the other thread:

You can train a custom sense2vec model, but it needs lots of text (likely more text than you said you had). And then you need to follow the steps described in the repository and the documentation, which tells you how to do it. Sorry, there's not much more we can help with – we already open-sourced all our scripts and documented the steps.

1 Like

ok. thank you for your response. I'll try more

thank you again for your help, I have worked on that, I could make it work till middle of step -3

where I used

!python 03_glove_build_counts.py "C:/Users/moha/Documents/Models/glove.42B.300d.txt" "../data/output02" "../data/output03"  

the I faced with this

i] Using 1 input files
[+] Created output directory ../data/output03
[i] Creating vocabulary counts
cat ..\data\output02\myOutFile.s2v | C:/Users/moha/Documents/Models/glove.42B.300d.txt/vocab_count -min-count 5 -verbose 2 > ..\data\output03\vocab.txt

[x] Failed creating vocab counts

'cat' is not recognized as an internal or external command,
operable program or batch file.

I think "cat" it is unix command, does it means that the script 03 does not work with windows ?

can someone know how can I make it work for windows?

best

I have also tried "type" instead of "cat" and it dies not work. I have laso tried

fasttext but there it is written explicitly: :slight_smile:

Generally, fastText builds on modern Mac OS and Linux distributions. Since
it uses some C++11 features, it requires a compiler with good C++11
support. These include :

so basically  I need to Building fastText which as it is mentioned in the
page it is possible in mac and linux so :

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
$ unzip v0.9.1.zip
$ cd fastText-0.9.1
$ make


 or

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

do you know that there is a way to do it in windows? If not ,is it
possible that someone build it and then send the binary file to me? (if this is at
all reasonable, however I do not feel so :))

best

Hi all,

good new, I have also found this unofficial binary version of fasttext following this:

bad news, but by some reason, I can not run it using Visual C++
however I have installed fasttext in my python environment

here is the last update , I believe I have fasttext properly in my computer

I have tried this basically use address as Path to the fasttext binary

!python 04_fasttext_train_vectors.py -c 10 "C:/Users/moha/Documents/Models/Debug/fasttext.dll" "../data/output02" "../data/output04"  

I faced with this

[i] Created temporary merged input file
..\data\output02\s2v_input.tmp
[i] Training vectors
C:/Users/moha/Documents/Models/Debug/fasttext.dll skipgram -thread 10 -input ..\data\output02\s2v_input.tmp -output ..\data\output04\vectors_w2v_300dim -dim 300 -minn 0 -maxn 0 -minCount 10 -verbose 2
[+] Deleted temporary input file
..\data\output02\s2v_input.tmp

[x] Failed training vectors

The system cannot execute the specified program.

I would be very thankful if you give me some hints

Best

@ines
Dear Ines

I would be very thankful of your help, I have done all my efforts, I believe that there is the way the windows users can use your nice scripts, I explained that above I have done some steps but stucked in fast text or glove. (scripts03-04) I think the only problem is that I should find (build) somehow the binary file of those model by windows. I explained what I have tried. (in last reply)

However, I have trained using purely Word2Vec and Fasttex of gensim and here is the results

model_ft.wv.most_similar('imagination')

[('illumination', 0.982839822769165),
 ('translation', 0.9825044870376587),
 ('revolution', 0.9816499352455139),
 ('deviation', 0.9815695285797119),
 ('occultation', 0.9805629849433899),
 ('location', 0.9787797927856445),
 ('examination', 0.9787137508392334),
 ('alteration', 0.9734417796134949),
 ('fiction', 0.9707186222076416),
 ('discussion', 0.9705220460891724)]

but since in your script, I see that you use posttag and ner also , It would be great if you help make it work?

Best

@jediwarpraptor
@justindujardin
@adriane

I would be happy of any kind of feedback