config.cfg for bert.ner.manual

Hi, I annotated my file with the command

prodigy bert.ner.manual data_5_trf ./input/data_5_ground_truth_1.0.jsonl --label RIGHTV,RIGHTN,ACCESSV,ACCESSN --tokenizer-vocab ./bert-base-uncased-vocab.txt --lowercase --hide-wp-prefix -F

And i trained with this command

prodigy train --ner data_5_trf ./tmp_model --eval-split 0.2 --config config.cfg --gpu-id 0 --label-stats

My question is: Should i change some lines in config.cfg file to match the bert.ner.mannual recipe? Especially set the tokenizer-vocab to be './bert-base-uncased-vocab.txt'?

BTW, when i try to set vocab_data on config.cfg, there is an error.

vocab_data = './bert-base-uncased-vocab.txt'

Before diving deeper into this question I just want to make sure that I understand what your goal is. If you're trying to train a BERT model, you can also use spaCy without having to resort to this custom recipe. To quote the docs:

New in Prodigy v1.11 and spaCy v3

spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

So just to check, are you trying to train a BERT model using spaCy? If so, you might just want to follow the steps that I describe here. If you're trying to generate data for another library, like Huggingface, that depends on the sentencepiece tokeniser ... then I can dive a bit deeper.

Hey, thanks for your reply. I actually followed the tutorial here.

If you’re creating training data for fine-tuning a transformer , you can use its tokenizer to preprocess your texts to make sure that the data you annotate is compatible with the transformer tokenization . It also makes annotation faster, because your selection can snap to token boundaries. The following recipe implementation uses Hugging Face’s easy-to-use tokenizers library under the hood.

In this example, how do you think what should be changed the config.cfg?

As described here, you can load any Huggingface model you want in spaCy and have spaCy train a model using its features, but that's not the same thing as being able to fine-tune it. For that, you'll probably want to use the Huggingface library itself.