Hi, I annotated my file with the command
prodigy bert.ner.manual data_5_trf ./input/data_5_ground_truth_1.0.jsonl --label RIGHTV,RIGHTN,ACCESSV,ACCESSN --tokenizer-vocab ./bert-base-uncased-vocab.txt --lowercase --hide-wp-prefix -F transformers_tokenizers.py
And i trained with this command
prodigy train --ner data_5_trf ./tmp_model --eval-split 0.2 --config config.cfg --gpu-id 0 --label-stats
My question is: Should i change some lines in config.cfg file to match the
bert.ner.mannual recipe? Especially set the tokenizer-vocab to be './bert-base-uncased-vocab.txt'?
BTW, when i try to set vocab_data on config.cfg, there is an error.
vocab_data = './bert-base-uncased-vocab.txt'
Before diving deeper into this question I just want to make sure that I understand what your goal is. If you're trying to train a BERT model, you can also use spaCy without having to resort to this custom recipe. To quote the docs:
New in Prodigy v1.11 and spaCy v3
spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use
data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run
train and provide the config via the
So just to check, are you trying to train a BERT model using spaCy? If so, you might just want to follow the steps that I describe here. If you're trying to generate data for another library, like Huggingface, that depends on the sentencepiece tokeniser ... then I can dive a bit deeper.
Hey, thanks for your reply. I actually followed the tutorial here.
If you’re creating training data for fine-tuning a transformer , you can use its tokenizer to preprocess your texts to make sure that the data you annotate is compatible with the transformer tokenization . It also makes annotation faster, because your selection can snap to token boundaries. The following recipe implementation uses Hugging Face’s easy-to-use
tokenizers library under the hood.
In this example, how do you think what should be changed the config.cfg?
As described here, you can load any Huggingface model you want in spaCy and have spaCy train a model using its features, but that's not the same thing as being able to fine-tune it. For that, you'll probably want to use the Huggingface library itself.