Hi,
I am running into an issue with evaluation on a separate dataset for a text classification task.
I have 4 files in jsonl format that represent:
- samples of class1 {"id": "unique", "text": "xyz", "label": "class1"}
- samples of class 2 {"id": "unique", "text": "xyz", "label": "class2"}
- evaluation samples of class1 {"id": "unique", "text": "xyz", "label": "class1"}
- evaluation samples of class2 {"id": "unique", "text": "xyz", "label": "class2"}
I created a Prodigy dataset for training, and used db-in to read the first two files. I then trained a model with the default 20% held-out set for evaluation:
prodigy train textcat train_dataset en_core_web_sm
This worked correctly.
I then created a prodigy dataset for evaluation, and used db-in to read the second two files. I trained a model on these in order to test that everything was working correctly:
prodigy train textcat eval_dataset en_core_web_sm
This also worked as expected.
However, when I try to train on the train_dataset and evaluate on the eval_dataset:
prodigy train textcat train_dataset en_core_web_sm --eval-id eval_dataset
... I get the following error:
✔ Loaded model 'en_core_web_sm'
Created and merged data for 3665 total examples
Created and merged data for 2539 total examples
Using 3665 train / 2539 eval (from 'eval_dataset')
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/prodigy/__main__.py", line 60, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/plac-0.9.6-py3.6.egg/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/plac-0.9.6-py3.6.egg/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/prodigy/recipes/train.py", line 147, in train
baseline = nlp.evaluate(eval_data)
File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/spacy/language.py", line 691, in evaluate
scorer.score(doc, gold, **kwargs)
File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/spacy/scorer.py", line 239, in score
gold_ents = set(tags_to_entities([annot[-1] for annot in gold.orig_annot]))
TypeError: 'NoneType' object is not iterable
Appreciate any insights that might help.
Best,
James