I have just installed prodigy version 1.11 but I am having some trouble updating a notebook with a NER pipeline that I had prepared a few months ago (spacy 2.3.5, prodigy 1.10).
I have this txt file of domain specific word vectors, which I used to prepare a spacy model with this syntax:
!python -m spacy init-model en ./ft_vectors_model --vectors-loc EURLEX_ft_vectors.txt
now updated to:
!python -m spacy init vectors en ./EU_laws_FT_vectors.txt ./ft_vectors_model.
What I want to do next is to train this model with a set of NER annotations, which is somethiing that in the previous versions of Prodigy I managed to do this way:
========================= Generating Prodigy config =========================
[i] Auto-generating config with spaCy
[i] Using config from base model
[+] Generated training config
=========================== Initializing pipeline ===========================
Config validation error
Bad value substitution: option 'width' in section 'components.ner.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'
[2021-10-18 18:54:26,898] [INFO] Set up nlp object from config
Any advice on how to proceed?
**All suggestions are welcome! ** Thanks in advance!
Thanks for the detailed report and sorry that you've been running into issues! I can reproduce this and am looking into a fix/workaround right now. I'll get back to you on this thread when I know more.
NER train
!python -m prodigy train ner --EU_laws_NER ./FT_vectors_model --init-tok2vec ./pretrained_FT_model/model99.bin --eval-split 0.2 --output ./FT_tok2vec_NER_model
Now, as far as I understand in the "new" spacy everything is handled via a config file, but i dont know whether this is the proper way to approach this task in prodigy..
First some good news: we were able to locate the bug and will line up the fix for the next patch release.
To answer some of your other questions:
You're right that with spaCy v3, we heavily started using the config files instead of controling the train loop from code. While the switch may take a little getting use to, you'll notice that the config file actually gives you a lot more flexibility & control over the training loop. If you haven't already seen these, an introduction to the config system can be found in the spaCy docs or in this video (minutes 5 to 11 mainly).
We're still providing the command prodigy train for convenience purposes, and it generates a config file "on the fly". This has the disadvantage that you have less control over the config files and the error you originally got is meaningless to you, because you're not controlling the config generation.
Instead, what I'd recommend is using the config file and spacy train directly instead of using prodigy train. The key to make this work is our new command data-to-spacy. This will generate both a config file and .spacy files with your annotation data in them. You can use only the data files and create your own config, if you'd like.
About training and pretraining: this functionality has been improved in spaCy v3, and is indeed now also covered by the config file. The key is to use THE EXACT SAME config file for both the spacy pretrain and spacy train commands. More docs here.
Maybe it would make sense for you to have a look at the config files and all that, and data-to-spacy, and see how you go with the pretraining? Then ping me here if you run into specific issues!
thanks a lot for your swift feedback and for the clarification.
i am switching to spacy train as soon as I can figure out how the new configuration system works =)
moreover, the team that I am working with is very interested in the new 'relation extraction' functions, and as far as I understand this is not something that can be managed in the prodigy framework (apart from the annotation of course).
so thanks again for the time being /but i guess I will be back with issues and questions in the near future).
the team that I am working with is very interested in the new 'relation extraction' functions, and as far as I understand this is not something that can be managed in the prodigy framework (apart from the annotation of course).