pretrain weights for transfer learning

I am new to prodigy. I have an NER task of which I have used ner.manual to provide some initial labels. In order to test the prospect I want to use ner.train with pretrained weights. I tried using Scibert provided weights and another fine tuned weights I have of NER built on Scibert....Both failed with the message:

srsly.msgpack.exceptions.ExtraData: unpack(b) received extra data.

So, my question is, how can I use some of these pretrained weights with prodigy since Scibert is also compliant with Scispacy I expected its weights should have worked.

How can I get pretrained weights for biomedical NER that can work with prodigy?

Thank you

Hi! The prodigy train command is a thin wrapper around spaCy and expects pretrained weights generated with spacy pretrain: https://spacy.io/api/cli/#pretrain

If you want to train a model using different weights, transformers etc., you can export your annotations using db-out and then train your model outside of Prodigy, using whichever library you want to use.

(The upcoming v3 of spaCy will feature support for custom pretrained embeddings and transformer-based pipelines out-of-the-box, so we'll probably be able to support this in Prodigy out-of-the-box as well.)

Thank you for the clarification. The idea is to train amodel within Prodigy to speed up my annotation. I am generating data to train model for cusotom NERs. I will try and pretain with a Spacy model.

However, I still want to clarify as only 'en_vectors_web_lg' seem to work for this prodigy training with the example pretrained weights in your video.

Which other spacy models could work? I have tried ' en_trf_bertbaseuncased_lg', which is also a spacy model It generated an error amout get method. Or is it that in order to have a suitable tok2vec weights I will use this as the input vectors model into spacy pretrain alongside my data?

Yes, that's correct – spacy pretrain uses a language modelling objective to pretrain weights using word vectors. So you can use any model that has word vectors – en_vectors_web_lg is usually the best one.

The transformer models are just transformer weights packaged for spaCy, not word vectors. They're pretrained language models in themselves, so it wouldn't really make sense to use them with spacy pretrain.

en_vectors_web_lg` is usually the best one.

Thank you Ines for this clarificaion. Is it also suitale for biomedical data? Are there any pre-trained weights like the food recipe for biomedical data that can work with this vector (or any other for that matter)?

Furthermore equestion though, just to get a hang of things I trained a model using the en_vectors_web_lg (not sure though if it's suitable for biomedical data) and the pretrained weights as shown in your food recipe video (which I think I shouldn't have used). The model trained and showed some promising figures. However, when I used the trained model in ner.correct, there was an information that the data would no longer be tokenized.
What could have caused this? I used en_core_sci_lg for initial data annotation with ner.maual because it tokenizes biomedical data better.

Lastly, when I was using thener.correct with --exclude the annotation seem not to be excluding my previously annotated data. It starts from the beginng of the file each tme. Can you think of what I might be be doing wrongly?

Thanks

The en_vectors_web_lg are repackaged GloVe vectors trained on the Common Crawl corpus. See here for details: https://nlp.stanford.edu/projects/glove/

They're likely not very useful for biomedical data because they were trained on general-purpose text. But you can always train your own vectors or use other existing vectors that were trained on more relevant data. Maybe just use the vectors from en_core_sci_lg? According to the scispaCy docs, the model ships with 600k vectors.

What was the exact message here? Maybe it was more related to tokenization mismatches? This can happen if your annotations were created with a different tokenizer than the one you want to use at runtime / during training. If that model produces different tokens, it may not be able to learn from your annotations, so Prodigy shows a warning.

What are you setting the --exclude flag to? And is there any difference in the command you're running or anything else that could cause the new examples that are created to be different? For instance, different tokenization, sentence segmentation vs. no sentence segmentation etc.?

This is the part I won't mind being guided on how to go about it. Should I just use en_core_sci_lg as is or how do I invike its vector for this training? I tried using en_core_sci_lg but the pretrained vector weights to invoke (something similar to tok2vec_cd8_model289.bin) is still a challenge or do I have train a separate model and use the weight. I tried model weights from the en_core_sci_lg it did not work.