Efficient data for transformers like BERT

how can i change already annotated data with prodigy to
Efficient data for transformers like BERT ?

Have you seen our docs section on using Transformers in Prodigy?

yes where where do i find /bert-base-uncased-vocab.txt ?

It's on HF's website: bert-base-uncased-vocab.txt

Good point - we should add this to the docs. Thank you!

i keep getting this error : " sep_token not found in the vocabulary "

hm... I just tried on a random dataset and didn't have any problems.

food_data.jsonl (2.3 MB)

python -m prodigy bert.ner.manual ner_food food_data.jsonl --label INGRED --tokenizer-vocab bert-base-uncased-vocab.txt --lowercase --hide-wp-prefix -F transformers_tokenizers.py

so the question is do i need to annotate again or it will make my allready annotated data to be Efficient annotation the transformers and i want to use roberta-base so is there a vocab.text for it ?

Using 8 label(s): College_Name, Companies_Worked_At, Degree, Languages, Email_Address, Name, Skills, Years_of_Experience Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/prodigy/main.py", line 62, in controller = recipe(args, use_plac=True) File "cython_src/prodigy/core.pyx", line 374, in prodigy.core.recipe.recipe_decorator.recipe_proxy File "/usr/local/lib/python3.10/dist-packages/plac_core.py", line 367, in call cmd, result = parser.consume(arglist) File "/usr/local/lib/python3.10/dist-packages/plac_core.py", line 232, in consume return cmd, self.func((args + varargs + extraopts), **kwargs) File "/content/transformers_tokenizers.py", line 47, in ner_manual_tokenizers_bert tokenizer = BertWordPieceTokenizer(tokenizer_vocab, lowercase=lowercase) File "/usr/local/lib/python3.10/dist-packages/tokenizers/implementations/bert_wordpiece.py", line 57, in init raise TypeError("sep_token not found in the vocabulary") TypeError: sep_token not found in the vocabulary

hi @YassineSboui,

Were you able to replicate my example?

I think you may have accidentally saved your vocab.txt file wrong. It's telling you what the problem is -- it's not seeing the [SEP] token in your vocab.txt. If you're not able to recreate my example using your saved vocab.txt file then I think that's what the problem is.

I'm not really sure what "annotate again" means. Did you annotate once without transformers and now you want to start annotating with transformers? Be aware, that all of your annotations need to use a similar tokenizer or else you'll deal with a tokenization mismatch. See the docs or other support issues for more details.

Also, I'm not sure on roberta-base as its a third-party tool. I did a quick google search and it seems like it uses a vocab.json and merges.txt files for its tokenizer.