Hi,
I am replaying the Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning.
I am working on french language. I did the ner.manual on my own labels without any issue.
Right now I want to train the token2vect for ner with my dataset and using the French tok2vec pretain token.
I am missing information about how to pretrain this token2vec .
Ines did this pretraining, which took ~8 hours on GPU, and it can be found here: tok2vec_cd8_model289.bin
I would like to learn how to do the same thing for French language.
Thanks in advance
Hi! The pretrained tok2vec weights were created using the spacy pretrain
command with a lot of raw text. You can find the details and documentation here:
- spaCy v2: https://v2.spacy.io/api/cli/#pretrain
- spaCy v3: https://spacy.io/api/cli/#pretrain
The pretraining uses a language modelling objective, similar to how embeddings like BERT are trained. If you have a lot of raw text, this can be a good way to boost your accuracy.
If you're using spaCy v3, you can also initialize your model with existing transformer embeddings, which will have a similar effect. You can use the quickstart widget to generate a transformer-based config for French here: https://spacy.io/usage/training#quickstart To export your annotations for use with spaCy, you can use the data-to-spacy
command. If you're using transformers, you should have a GPU available for training.
Hi Ines,
Thanks for your replies, I am using Spacy V3 so I will investigate the quickstart you've pointed to me.
Have a good day and keep going to provide us such great library and useful tools !