v1.9.7 train with --eval-id gives error

Hi,

I am running into an issue with evaluation on a separate dataset for a text classification task.

I have 4 files in jsonl format that represent:

  • samples of class1 {"id": "unique", "text": "xyz", "label": "class1"}
  • samples of class 2 {"id": "unique", "text": "xyz", "label": "class2"}
  • evaluation samples of class1 {"id": "unique", "text": "xyz", "label": "class1"}
  • evaluation samples of class2 {"id": "unique", "text": "xyz", "label": "class2"}

I created a Prodigy dataset for training, and used db-in to read the first two files. I then trained a model with the default 20% held-out set for evaluation:

prodigy train textcat train_dataset en_core_web_sm

This worked correctly.

I then created a prodigy dataset for evaluation, and used db-in to read the second two files. I trained a model on these in order to test that everything was working correctly:

prodigy train textcat eval_dataset en_core_web_sm

This also worked as expected.

However, when I try to train on the train_dataset and evaluate on the eval_dataset:

prodigy train textcat train_dataset en_core_web_sm --eval-id eval_dataset

... I get the following error:

✔ Loaded model 'en_core_web_sm'
Created and merged data for 3665 total examples
Created and merged data for 2539 total examples
Using 3665 train / 2539 eval (from 'eval_dataset')
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/plac-0.9.6-py3.6.egg/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/plac-0.9.6-py3.6.egg/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/prodigy/recipes/train.py", line 147, in train
    baseline = nlp.evaluate(eval_data)
  File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/spacy/language.py", line 691, in evaluate
    scorer.score(doc, gold, **kwargs)
  File "/mnt/c/Users/james/venv/ubusci/lib/python3.6/site-packages/spacy/scorer.py", line 239, in score
    gold_ents = set(tags_to_entities([annot[-1] for annot in gold.orig_annot]))
TypeError: 'NoneType' object is not iterable

Appreciate any insights that might help.

Best,
James

Hi! Thanks for the report, that's definitely strange :thinking: It looks like there might be some value that leads to a None somewhere that possibly gets ignored if you train from the example directly, but trips up the evaluation logic. If you look at the data in your eval_dataset, is there anything in there that looks suspicious?

Also, it might not matter, but when you get a chance, could you try upgrading to the latest v1.9.9? I'd have to double-check but it's possible that the underlying problem was already fixed.

Thanks!

Ok, I apparently had some entries whose text was "". Is this really an "error"? In Spacy I can create a perfectly valid (albeit useless) empty document object. Perhaps the db-in command could skip them and then print the "number of non-empty documents imported"?

For more context, the reason this happens for me is that I am doing some specific pre-processing that can (in rare cases) leave no text in the input document. I can check this myself, obviously, but I didn't find the error very transparent...

Thanks for checking!

An empty string is definitely valid! spaCy shouldn't choke on it, and Prodigy shouldn't make any assumptions about what it means, either.

Maybe the problem here is caused by the preprocessing or data merging logic – maybe we have an incorrect None check somewhere (that's supposed to check for None values, but instead just checks if a value is falsy). I'll try to track this down :slightly_smiling_face: