prodigy-hf KeyError: 'tokens' when training pre-existing ner dataset

mv3 · December 17, 2023, 7:34am

Hi, tried out prodigy-hf:

python -m prodigy hf.train.ner \
ner_citeables \
output/ner_citerables_hf \
--epochs 10 \
--model-name distilbert-base-uncased

Getting the following error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy/__main__.py", line 50, in <module>
    main()
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy/__main__.py", line 44, in main
    controller = run_recipe(run_args)
                 ^^^^^^^^^^^^^^^^^^^^
  File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
  File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy_hf/ner.py", line 182, in hf_train_ner
    gen_train, gen_valid, label_list, id2lab, lab2id = into_hf_format(train_examples, valid_examples)
                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy_hf/ner.py", line 66, in into_hf_format
    train_out = list(generator(train_examples))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy_hf/ner.py", line 53, in generator
    tokens = [tok['text'] for tok in ex["tokens"]]
                                     ~~^^^^^^^^^^
KeyError: 'tokens'
/Users/mv/.pyenv/versions/3.11.6/lib/python3.11/tempfile.py:895: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/var/folders/7g/qc77bj6j63q0gr0mrzvbmxkc0000gn/T/tmpecx3q6kl'>
  _warnings.warn(warn_message, ResourceWarning)

The pre-existing ner dataset was based on the standard spacy pipeline of [tok2vec, ner]. Is prodigy-hf usable only on datasets previously annotated with transformer model, e.g. prodigy bert.ner.manual?

Hoping for advice on the issue or if I'm doing something wrong. Running on:

MacOS
Both errors appear on Python 3.11.6 and 3.11.7
spacy 3.7.2, prodigy-1.14.12

magdaaniol · December 21, 2023, 7:55pm

Hi @mv3,

Since the recipe takes care of transforming the data to the format expected by hf any Prodigy NER dataset should be fine to use with this recipe. It's true though that tokenization is required. Could you double check if the dataset that you're using as input contains tokens key? If not, you should modify the recipe to add tokenization. There's a helper function for this.
If you do have tokens in all your examples, could you share one example? Thanks!

Topic		Replies	Views
Issue with HF NER training	3	172	February 16, 2024
KeyError: 'token_end' when trying to use ner.batch-train ner , done	9	889	June 7, 2019
Commands for training NER-Model in prodigy usage , ner , solved , training	9	1132	January 9, 2023
ner correct with prodigy 1.11.8 ner	11	541	December 30, 2022
Prodigy created model does not work usage , ner	2	746	November 9, 2018

prodigy-hf KeyError: 'tokens' when training pre-existing ner dataset

Related topics