Hi all,
I am trying to add fasttext word vectors to spaCy, so that I can save a model to use in a NER recipe. The embeddings are domain specific (legislative text) and they are trained on a very large corpus, I thought they would do a better job in helping NER than pretrained embeddings.
I browsed the forum and the documentation and I came up with the following steps:
- use my text corpus to train a fasttext model and get a bin fasttext file
- extract word vectors from the bin and save them in a txt format (each line contains a word followed by its vector, each value is space separated)
- add word vectors to spacy
python -m spacy init-model en /ft_vectors --vectors-loc EURLEX_ft_vectors.txt
- (not so sure about this) pretrain the model on raw corpus:
python -m spacy pretrain EURLEX_raw.jsonl ./ft_vectors ./pretrained_model --use-vectors
, does this make sense? - train a NER model on annotated data:
python -m prodigy train ner EURLEX_training_set ./ft_vectors --init-tok2vec ./pretrained_model/model499.bin --output ./model1
As of now I am stuck in step 3. I get the following error when running spacy init-model
.
(My framework:
.spacy 2.3.5
.prodigy 1.10
.python 3.8
.windows 10)
Traceback (most recent call last):
File "C:\Anaconda\envs\prodigy\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Anaconda\envs\prodigy\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy_main.py", line 33, in
plac.call(commands[command], sys.argv[1:])
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 113, in init_model
add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 207, in add_vectors
vectors_data, vector_keys = read_vectors(vectors_loc, truncate_vectors)
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 228, in read_vectors
shape = tuple(int(size) for size in next(f).split())
File "C:\Users\Giovanni\AppData\Roaming\Python\Python38\site-packages\spacy\cli\init_model.py", line 228, in
shape = tuple(int(size) for size in next(f).split())
ValueError: invalid literal for int() with base 10: 'the'
It appears that spacy expects something different in the first line. Should I load the vectors in another format?
Any help on this is greatly appreciated!