Prodigy train fr vectors for Token2Vect

Hi,
I am replaying the Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning.
I am working on french language. I did the ner.manual on my own labels without any issue.
Right now I want to train the token2vect for ner with my dataset and using the French tok2vec pretain token.
I am missing information about how to pretrain this token2vec .
Ines did this pretraining, which took ~8 hours on GPU, and it can be found here: tok2vec_cd8_model289.bin
I would like to learn how to do the same thing for French language.
Thanks in advance

Hi! The pretrained tok2vec weights were created using the spacy pretrain command with a lot of raw text. You can find the details and documentation here:

The pretraining uses a language modelling objective, similar to how embeddings like BERT are trained. If you have a lot of raw text, this can be a good way to boost your accuracy.

If you're using spaCy v3, you can also initialize your model with existing transformer embeddings, which will have a similar effect. You can use the quickstart widget to generate a transformer-based config for French here: https://spacy.io/usage/training#quickstart To export your annotations for use with spaCy, you can use the data-to-spacy command. If you're using transformers, you should have a GPU available for training.

Hi Ines,
Thanks for your replies, I am using Spacy V3 so I will investigate the quickstart you've pointed to me.
Have a good day and keep going to provide us such great library and useful tools !

1 Like