pretrained tok2vec weights - prodigy v 1.11

Hi all,

I have just installed prodigy version 1.11 but I am having some trouble updating a notebook with a NER pipeline that I had prepared a few months ago (spacy 2.3.5, prodigy 1.10).

I have this txt file of domain specific word vectors, which I used to prepare a spacy model with this syntax:

!python -m spacy init-model en ./ft_vectors_model --vectors-loc EURLEX_ft_vectors.txt

now updated to:

!python -m spacy init vectors en ./EU_laws_FT_vectors.txt ./ft_vectors_model.

What I want to do next is to train this model with a set of NER annotations, which is somethiing that in the previous versions of Prodigy I managed to do this way:

!python -m prodigy train ner EU_laws_NER ./ft_vectors_model --eval-split 0.2 --output ./tmp_model_ft

I tried to update my syntax in this way:

!python -m prodigy train ./tmp_model_ft --ner EU_laws_NER --base-model ./ft_vectors_model --eval-split 0.2

but I get the following error:

========================= Generating Prodigy config =========================
[i] Auto-generating config with spaCy
[i] Using config from base model
[+] Generated training config

=========================== Initializing pipeline ===========================
Config validation error
Bad value substitution: option 'width' in section 'components.ner.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'
[2021-10-18 18:54:26,898] [INFO] Set up nlp object from config

Any advice on how to proceed?

**All suggestions are welcome! **
Thanks in advance!

G

Hi,

Thanks for the detailed report and sorry that you've been running into issues! I can reproduce this and am looking into a fix/workaround right now. I'll get back to you on this thread when I know more.

1 Like

Hi Sofie,
thanks a lot for looking into this, it is really appreciated.

I am also struggling with the next two steps, ie:

  1. pretraining the model on the raw corpus and
  2. use the weights of the tok2vec component while training a NER model on annotated data,

which I previously carried out like this:

  1. pretrain weights
    !python -m spacy pretrain EU_laws_raw.jsonl ./FT_vectors_model ./pretrained_FT_model --dropout 0.3 --batch-size 16 --n-iter 100 --use-vectors

  2. NER train
    !python -m prodigy train ner --EU_laws_NER ./FT_vectors_model --init-tok2vec ./pretrained_FT_model/model99.bin --eval-split 0.2 --output ./FT_tok2vec_NER_model

Now, as far as I understand in the "new" spacy everything is handled via a config file, but i dont know whether this is the proper way to approach this task in prodigy..

(i'm sorry if all this sounds very trivial =)

thanks for your help!
g

Hi again!

First some good news: we were able to locate the bug and will line up the fix for the next patch release.

To answer some of your other questions:

You're right that with spaCy v3, we heavily started using the config files instead of controling the train loop from code. While the switch may take a little getting use to, you'll notice that the config file actually gives you a lot more flexibility & control over the training loop. If you haven't already seen these, an introduction to the config system can be found in the spaCy docs or in this video (minutes 5 to 11 mainly).

We're still providing the command prodigy train for convenience purposes, and it generates a config file "on the fly". This has the disadvantage that you have less control over the config files and the error you originally got is meaningless to you, because you're not controlling the config generation.

Instead, what I'd recommend is using the config file and spacy train directly instead of using prodigy train. The key to make this work is our new command data-to-spacy. This will generate both a config file and .spacy files with your annotation data in them. You can use only the data files and create your own config, if you'd like.

About training and pretraining: this functionality has been improved in spaCy v3, and is indeed now also covered by the config file. The key is to use THE EXACT SAME config file for both the spacy pretrain and spacy train commands. More docs here.

Maybe it would make sense for you to have a look at the config files and all that, and data-to-spacy, and see how you go with the pretraining? Then ping me here if you run into specific issues!

1 Like

Hi Sofie,

thanks a lot for your swift feedback and for the clarification.

i am switching to spacy train as soon as I can figure out how the new configuration system works =)

moreover, the team that I am working with is very interested in the new 'relation extraction' functions, and as far as I understand this is not something that can be managed in the prodigy framework (apart from the annotation of course).

so thanks again for the time being /but i guess I will be back with issues and questions in the near future).

all the best
g

Hi,

Happy to help!

the team that I am working with is very interested in the new 'relation extraction' functions, and as far as I understand this is not something that can be managed in the prodigy framework (apart from the annotation of course).

That's right. The annotation functionality is described here. To learn these kind of relations with spaCy, we do have a tutorial with a quick-and-dirty implementation that you could hopefully use to build upon: https://github.com/explosion/projects/tree/v3/tutorials/rel_component

If you run into any issues getting started with the config system or REL project in spaCy, you can definitely also ask on our discussions forum: explosion/spaCy Help Coding Implementations · Discussions · GitHub

And if you run into any other issues/questions with Prodigy, yes do feel free to open a new thread! :slight_smile: