Hi, I annotated my file with the command
prodigy bert.ner.manual data_5_trf ./input/data_5_ground_truth_1.0.jsonl --label RIGHTV,RIGHTN,ACCESSV,ACCESSN --tokenizer-vocab ./bert-base-uncased-vocab.txt --lowercase --hide-wp-prefix -F transformers_tokenizers.py
And i trained with this command
prodigy train --ner data_5_trf ./tmp_model --eval-split 0.2 --config config.cfg --gpu-id 0 --label-stats
My question is: Should i change some lines in config.cfg file to match the
bert.ner.mannual recipe? Especially set the tokenizer-vocab to be './bert-base-uncased-vocab.txt'?
BTW, when i try to set vocab_data on config.cfg, there is an error.
vocab_data = './bert-base-uncased-vocab.txt'
Before diving deeper into this question I just want to make sure that I understand what your goal is. If you're trying to train a BERT model, you can also use spaCy without having to resort to this custom recipe. To quote the docs:
New in Prodigy v1.11 and spaCy v3
spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use
data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run
train and provide the config via the
So just to check, are you trying to train a BERT model using spaCy? If so, you might just want to follow the steps that I describe here. If you're trying to generate data for another library, like Huggingface, that depends on the sentencepiece tokeniser ... then I can dive a bit deeper.
Hey, thanks for your reply. I actually followed the tutorial here.
If you’re creating training data for fine-tuning a transformer , you can use its tokenizer to preprocess your texts to make sure that the data you annotate is compatible with the transformer tokenization . It also makes annotation faster, because your selection can snap to token boundaries. The following recipe implementation uses Hugging Face’s easy-to-use
tokenizers library under the hood.
In this example, how do you think what should be changed the config.cfg?
As described here, you can load any Huggingface model you want in spaCy and have spaCy train a model using its features, but that's not the same thing as being able to fine-tune it. For that, you'll probably want to use the Huggingface library itself.
OK. But I am still confused. Do you know how to use prodigy to train the model after annotating following the prodigy example here? What is the follow-up step after the BERT+NER annotation with Prodigy?
For my understanding, what is your goal?
Do you wish to train and update a Huggingface BERT model without spaCy? If so, you'll need to use that library to train a component and you can use the data generated from this recipe. You'd need to take the extra effort here, because Huggingface might use a different tokeniser.
If you wish to use BERT as part of a spaCy pipeline, then you can use the normal
ner.manual recipe for annotation and you don't need to worry about the tokens. You can just use
en_core_web_trf as a model when running the train command from Prodigy. Assuming that you've annotated a dataset called
annotated_ner then you train command would look something like:
python -m prodigy train --ner annotated_ner --base_model en_core_web_trf