prodigy-hf KeyError: 'tokens' when training pre-existing ner dataset

Hi, tried out prodigy-hf:

python -m prodigy hf.train.ner \
ner_citeables \
output/ner_citerables_hf \
--epochs 10 \
--model-name distilbert-base-uncased

Getting the following error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy/", line 50, in <module>
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy/", line 44, in main
    controller = run_recipe(run_args)
  File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
  File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy_hf/", line 182, in hf_train_ner
    gen_train, gen_valid, label_list, id2lab, lab2id = into_hf_format(train_examples, valid_examples)
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy_hf/", line 66, in into_hf_format
    train_out = list(generator(train_examples))
  File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy_hf/", line 53, in generator
    tokens = [tok['text'] for tok in ex["tokens"]]
KeyError: 'tokens'
/Users/mv/.pyenv/versions/3.11.6/lib/python3.11/ ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/var/folders/7g/qc77bj6j63q0gr0mrzvbmxkc0000gn/T/tmpecx3q6kl'>
  _warnings.warn(warn_message, ResourceWarning)

The pre-existing ner dataset was based on the standard spacy pipeline of [tok2vec, ner]. Is prodigy-hf usable only on datasets previously annotated with transformer model, e.g. prodigy bert.ner.manual?

Hoping for advice on the issue or if I'm doing something wrong. Running on:

  1. MacOS
  2. Both errors appear on Python 3.11.6 and 3.11.7
  3. spacy 3.7.2, prodigy-1.14.12

Hi @mv3,

Since the recipe takes care of transforming the data to the format expected by hf any Prodigy NER dataset should be fine to use with this recipe. It's true though that tokenization is required. Could you double check if the dataset that you're using as input contains tokens key? If not, you should modify the recipe to add tokenization. There's a helper function for this.
If you do have tokens in all your examples, could you share one example? Thanks!