config.cfg for bert.ner.manual

Hi, I annotated my file with the command

prodigy bert.ner.manual data_5_trf ./input/data_5_ground_truth_1.0.jsonl --label RIGHTV,RIGHTN,ACCESSV,ACCESSN --tokenizer-vocab ./bert-base-uncased-vocab.txt --lowercase --hide-wp-prefix -F

And i trained with this command

prodigy train --ner data_5_trf ./tmp_model --eval-split 0.2 --config config.cfg --gpu-id 0 --label-stats

My question is: Should i change some lines in config.cfg file to match the bert.ner.mannual recipe? Especially set the tokenizer-vocab to be './bert-base-uncased-vocab.txt'?

BTW, when i try to set vocab_data on config.cfg, there is an error.

vocab_data = './bert-base-uncased-vocab.txt'

Before diving deeper into this question I just want to make sure that I understand what your goal is. If you're trying to train a BERT model, you can also use spaCy without having to resort to this custom recipe. To quote the docs:

New in Prodigy v1.11 and spaCy v3

spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

So just to check, are you trying to train a BERT model using spaCy? If so, you might just want to follow the steps that I describe here. If you're trying to generate data for another library, like Huggingface, that depends on the sentencepiece tokeniser ... then I can dive a bit deeper.

Hey, thanks for your reply. I actually followed the tutorial here.

If you’re creating training data for fine-tuning a transformer , you can use its tokenizer to preprocess your texts to make sure that the data you annotate is compatible with the transformer tokenization . It also makes annotation faster, because your selection can snap to token boundaries. The following recipe implementation uses Hugging Face’s easy-to-use tokenizers library under the hood.

In this example, how do you think what should be changed the config.cfg?

As described here, you can load any Huggingface model you want in spaCy and have spaCy train a model using its features, but that's not the same thing as being able to fine-tune it. For that, you'll probably want to use the Huggingface library itself.

OK. But I am still confused. Do you know how to use prodigy to train the model after annotating following the prodigy example here? What is the follow-up step after the BERT+NER annotation with Prodigy?

For my understanding, what is your goal?

Do you wish to train and update a Huggingface BERT model without spaCy? If so, you'll need to use that library to train a component and you can use the data generated from this recipe. You'd need to take the extra effort here, because Huggingface might use a different tokeniser.

If you wish to use BERT as part of a spaCy pipeline, then you can use the normal ner.manual recipe for annotation and you don't need to worry about the tokens. You can just use en_core_web_trf as a model when running the train command from Prodigy. Assuming that you've annotated a dataset called annotated_ner then you train command would look something like:

python -m prodigy train --ner annotated_ner --base_model en_core_web_trf