Hi, tried out prodigy-hf:
python -m prodigy hf.train.ner \
ner_citeables \
output/ner_citerables_hf \
--epochs 10 \
--model-name distilbert-base-uncased
Getting the following error:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy/__main__.py", line 50, in <module>
main()
File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy/__main__.py", line 44, in main
controller = run_recipe(run_args)
^^^^^^^^^^^^^^^^^^^^
File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy_hf/ner.py", line 182, in hf_train_ner
gen_train, gen_valid, label_list, id2lab, lab2id = into_hf_format(train_examples, valid_examples)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy_hf/ner.py", line 66, in into_hf_format
train_out = list(generator(train_examples))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mv/Code/start-prodigy/.venv/lib/python3.11/site-packages/prodigy_hf/ner.py", line 53, in generator
tokens = [tok['text'] for tok in ex["tokens"]]
~~^^^^^^^^^^
KeyError: 'tokens'
/Users/mv/.pyenv/versions/3.11.6/lib/python3.11/tempfile.py:895: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/var/folders/7g/qc77bj6j63q0gr0mrzvbmxkc0000gn/T/tmpecx3q6kl'>
_warnings.warn(warn_message, ResourceWarning)
The pre-existing ner dataset was based on the standard spacy pipeline of [tok2vec, ner]. Is prodigy-hf usable only on datasets previously annotated with transformer model, e.g. prodigy bert.ner.manual?
Hoping for advice on the issue or if I'm doing something wrong. Running on:
- MacOS
- Both errors appear on Python 3.11.6 and 3.11.7
- spacy 3.7.2, prodigy-1.14.12