I' m trying to train a NER model after manually annotating 2k chunks. I'm using spacy v2.2 with prodigy 1.10 so in this version of spacy using the command prodigy train ner I can only train spacy pipelines (en core web lg/sm), if I want to train Bert I would have to use new tokenization script (or if I can convert from prodigy produced jsonl to Bert compatible format) and use external tools like hugging face or transformers. There's no BERT version for prodigy train ner ?
Hi! The prodigy train
command is designed for quick training experiments with spaCy but you can always export your data and then train using a different library, e.g. PyTorch directly. There's no single "BERT-compatible format" – it really just depends on the model you want to train on top of the transformer weights and what it needs to predict. The JSONL will give you the annotations, including the text and the spans, which you can then use to update your model.
With spaCy v3 and the upcoming Prodigy (currently available as a nightly pre-release), you can also train spaCy pipelines initialised with transformer weights like BERT. That said, training with a transformer needs a GPU, so you typically want to export your annotations with data-to-spacy
and then train with a transformer-based config on a separate GPU machine.
Running a quick experiment without a transformers can still give you the same useful insights, though, so it's often a good idea to try that first: if there's a problem with your data and your model isn't learning anything, you usually want to fix that first – even with better embeddings, your model will never be as good as it could be. If your model is doing well and you get good results, you know that initialising it with transformer embeddings will likely give you a good boost in accuracy.
thanks a lot Ines