config.cfg for bert.ner.manual

ruiyeNLP · September 23, 2022, 9:36pm

Hi, I annotated my file with the command

prodigy bert.ner.manual data_5_trf ./input/data_5_ground_truth_1.0.jsonl --label RIGHTV,RIGHTN,ACCESSV,ACCESSN --tokenizer-vocab ./bert-base-uncased-vocab.txt --lowercase --hide-wp-prefix -F transformers_tokenizers.py

And i trained with this command

prodigy train --ner data_5_trf ./tmp_model --eval-split 0.2 --config config.cfg --gpu-id 0 --label-stats

My question is: Should i change some lines in config.cfg file to match the bert.ner.mannual recipe? Especially set the tokenizer-vocab to be './bert-base-uncased-vocab.txt'?

BTW, when i try to set vocab_data on config.cfg, there is an error.

vocab_data = './bert-base-uncased-vocab.txt'

koaning · September 26, 2022, 10:06am

Before diving deeper into this question I just want to make sure that I understand what your goal is. If you're trying to train a BERT model, you can also use spaCy without having to resort to this custom recipe. To quote the docs:

New in Prodigy v1.11 and spaCy v3

spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

So just to check, are you trying to train a BERT model using spaCy? If so, you might just want to follow the steps that I describe here. If you're trying to generate data for another library, like Huggingface, that depends on the sentencepiece tokeniser ... then I can dive a bit deeper.

ruiyeNLP · September 26, 2022, 8:21pm

Hey, thanks for your reply. I actually followed the tutorial here.

If you’re creating training data for fine-tuning a transformer , you can use its tokenizer to preprocess your texts to make sure that the data you annotate is compatible with the transformer tokenization . It also makes annotation faster, because your selection can snap to token boundaries. The following recipe implementation uses Hugging Face’s easy-to-use tokenizers library under the hood.

In this example, how do you think what should be changed the config.cfg?

koaning · September 28, 2022, 12:33pm

As described here, you can load any Huggingface model you want in spaCy and have spaCy train a model using its features, but that's not the same thing as being able to fine-tune it. For that, you'll probably want to use the Huggingface library itself.

ruiyeNLP · September 28, 2022, 8:00pm

OK. But I am still confused. Do you know how to use prodigy to train the model after annotating following the prodigy example here? What is the follow-up step after the BERT+NER annotation with Prodigy?

koaning · September 30, 2022, 8:33am

For my understanding, what is your goal?

Do you wish to train and update a Huggingface BERT model without spaCy? If so, you'll need to use that library to train a component and you can use the data generated from this recipe. You'd need to take the extra effort here, because Huggingface might use a different tokeniser.

If you wish to use BERT as part of a spaCy pipeline, then you can use the normal ner.manual recipe for annotation and you don't need to worry about the tokens. You can just use en_core_web_trf as a model when running the train command from Prodigy. Assuming that you've annotated a dataset called annotated_ner then you train command would look something like:

python -m prodigy train --ner annotated_ner --base_model en_core_web_trf

Topic		Replies	Views
BERT support for prodigy train ner usage , ner , spacy , solved	2	1026	June 30, 2021
BERT recipe when using transformer in pipeline? spacy , solved	8	1913	May 21, 2021
Training BERT on prodigy transformers , relations	3	818	February 2, 2023
data-to-spacy is not using my custom tokenizer ner , spacy	7	1094	May 15, 2023
transformers model for NER ner , spacy	6	409	October 31, 2023

config.cfg for bert.ner.manual

Related topics