train ner dataset -> ValueError: too many values to unpack

Hi all,
I have been working with spaCy for about 3 months and am brand-new to prodigy. I used a small set of small texts in JSONL and used ner.teach to do binary training on PERSON labels (only). After 77 annotations, the web app said there were no more tasks available. I saved and quit the session. When I try to then run train ner on the same dataset, it throws an error. I have spent time reading but cannot see what I am doing wrong. Any help appreciated.

The code and trace are below:

| => python3 -m prodigy dataset sentsmall "sentsmall dataset"

✔ Successfully added 'sentsmall' to database SQLite

___________________ | ~ @ Jacks-MacBook-Pro (jrs) 

| => python3 -m prodigy ner.teach sentsmall en_core_web_lg ./documents/sents_small.jsonl --label PERSON

Using 1 label(s): PERSON

✨ Starting the web server at http://localhost:8080 ...

Open the app in your browser and start annotating!

^C

✔ Saved 77 annotations to database SQLite

Dataset: sentsmall

Session ID: 2020-01-03_00-20-55

=> python3 -m prodigy train ner sentsmall en_core_web_lg --output ./sentsm --n-iter 20 --binary
✔ Loaded model 'en_core_web_lg'
Using 34 train / 33 eval (split 50%)
Component: ner | Batch size: compounding | Dropout: 0.2 | Iterations: 20
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/usr/local/lib/python3.7/site-packages/prodigy/recipes/train.py", line 136, in train
    eval_data = [(doc.text, annot) for doc, annot in eval_data]
  File "/usr/local/lib/python3.7/site-packages/prodigy/recipes/train.py", line 136, in <listcomp>
    eval_data = [(doc.text, annot) for doc, annot in eval_data]
ValueError: too many values to unpack (expected 2)

Thanks for the report and sorry about that – I just tracked down the bug and it was related to a problem with how evaluation data was interpreted for binary training. I've already fixed it and we'll include the fix in the next release.

In the meantime, you can work around this problem in 2 ways:

  • Go to line 136 in prodigy/recipes/train.py and move it into an if not binary condition. Like this:
if not binary:
    eval_data = [(doc.text, annot) for doc, annot in eval_data]

You can find the location of your Prodigy installation by running python -c "import prodigy;print(prodigy.__file__)" .

  • Alternatively, you can use the previous ner.batch-train recipe. You can run prodigy ner.batch-train --help to see the documentation and arguments it needs. It's pretty similar to the new train recipe. The stats it reports are a bit less detailed and not as nice, though.

Thank you Ines,

I changed line 136 and it gets through the training (I think) but then gives another error about the pipeline components. I used the same model (en_core_web_lg) for both ner.teach and train so I'm not sure what it could be? Code:

| => python3 -m prodigy train ner sentsmall en_core_web_lg --output     ./sentsm --n-iter 20 --binary
✔ Loaded model 'en_core_web_lg'
Using 34 train / 33 eval (split 50%)
Component: ner | Batch size: compounding | Dropout: 0.2 | Iterations: 20
ℹ Baseline accuracy: 0.032

=========================== ✨  Training the model ===========================

#    Loss       Skip    Right   Wrong   Accuracy
--   --------   -----   -----   -----   --------
 1       0.00       0       1       1      0.500                                
 2       1.90       0       1       1      0.500                                
 3       0.78       0       1       0      1.000                                
 4       0.00       0       1       0      1.000                                
 5       0.00       0       1       1      0.500                                
 6       0.00       0       1       1      0.500                                
 7       0.00       0       1       1      0.500                                
 8       0.00       0       1       0      1.000                                
 9       0.00       0       1       0      1.000                                
10       0.00       0       1       0      1.000                                
11       0.00       0       1       0      1.000                                
12       0.00       0       1       0      1.000                                
13       0.00       0       1       0      1.000                                
14       2.46       0       1       0      1.000                                
15       0.00       0       1       0      1.000                                
16       0.00       0       1       0      1.000                                
17       1.46       0       1       0      1.000                                
18       1.95       0       1       0      1.000                                
19       0.00       0       1       0      1.000                                
20       0.00       0       1       0      1.000                                

Correct     1    
Incorrect   0    
Baseline    0.032             
Accuracy    1.000

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/usr/local/lib/python3.7/site-packages/prodigy/recipes/train.py", line 171, in train
    disabled.restore()
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 1095, in restore
    raise ValueError(Errors.E008.format(names=unexpected))
ValueError: [E008] Some current components would be lost when restoring previous pipeline state. If you added components after calling `nlp.disable_pipes()`, you should remove them explicitly with `nlp.remove_pipe()` before the pipeline is restored. Names of the new components: ['sentencizer']
___________________    | ~ @ Jacks-MacBook-Pro (jrs) 
| =>

Ah, this seems to be caused by the binary annotation models adding a sentencizer if none is present in the model, which then results in a conflict when saving out the model and restoring the previously disabled component. We'll also fix this for the next release.

In the meantime, you can work around it by moving the line annot_model = ... above the line disabled = ... in the train recipe.

OK I'll try that. Thanks again Ines!

Update: I changed train.py as above.

I previously collected 20 datasets of silver annotations created by ner.teach, each binary on different labels. I believe a few of them had more than one label, but most were one label per ner.teach session. In all cases, the en_core_web_lg model was used.

I then used the train recipe (with --binary flag) and trained a fresh en_core_web_lg model on all these silver datasets simultaneously, and saved the resulting model, testmodelx.

I also previously collected a set of 5 gold datasets that were fully annotated using ner.correct. The models corrected were successively bootstrapped from en_core_web_lg via ner.batch-train.

I then used the train recipe to further train testmodelx with all 5 of my gold datasets (without --binary flag) and it fails:

| => python3 -m prodigy train ner     goodallfull1,goodallfull2,goodallfull3,goodallfull4,goodallfull5 ./testmodelx --output ./testmodelx2 --eval-split 0.2 --n-iter 150 --batch-size -1 --dropout 0.2
✔ Loaded model './testmodelx'
Created and merged data for 50 total examples
Using 40 train / 10 eval (split 20%)
Component: ner | Batch size: compounding | Dropout: 0.2 | Iterations: 150
ℹ Baseline accuracy: 24.002

=========================== ✨  Training the model ===========================

#    Loss       Precision   Recall     F-Score 
--   --------   ---------   --------   --------
1:  35%|█████████████████▏                               | 14/40 [00:17<00:38,  1.48s/it]Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/usr/local/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/usr/local/lib/python3.7/site-packages/prodigy/recipes/train.py", line 156, in train
    nlp.update(docs, annots, drop=dropout, losses=losses)
  File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 515, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 445, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 550, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "transition_system.pyx", line 95, in spacy.syntax.transition_system.TransitionSystem.get_oracle_sequence
  File "transition_system.pyx", line 156, in spacy.syntax.transition_system.TransitionSystem.set_costs
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace. You can also use the experimental `debug-data` command to validate your JSON-formatted training data. For details, run:
python -m spacy debug-data --help

I do not get this error either if I use train en_core_web_lg directly on the gold annotations, or when I trained successive models with these datasets using ner.batch-train. Furthermore, if I train en_core_web_lg on the golds, and then train the resulting model on the silvers, I do not get the error and everything runs. So I don't think there is a true tokenization mismatch. I should add however that the gold annotations were created with the --unsegmented flag. I would expect that labeling an entity across sentence boundaries is supported though.

Any idea what could be the problem here?

Just released Prodigy v1.9.5, which should resolve the underlying issues in the train recipe when using --binary (evaluation data and restoring disabled components).

That's a good observation and might explain what's going on. I need to double-check this, but I think the entity recognizer considers entities across sentence boundaries illegal, just like whitespace entities. Having these constraints makes sense becaue it limits the number of possible analyses that the model has to consider – and in any real-world scenario (assuming the sentence boundaries are accurate), a named entity should never span across sentence boundaries.

I guess you could easily test this by processing all examples with the same spaCy model and then checking if there are entity spans that don't fall between sentence boundaries.(So basically, check each each start/end pair in the "spans" against the sent.start_char/sent.end_char offsets.)

One level of variance you're introducing during training comes from the held-back evaluation examples. If you're using the same datasets in the same order and have the random seed set (which Prodigy should do by default), the 20% of examples that are held back should always be the same. However, if you change the datasets or their order, you may end up wtih different examples. So a problematic example that was previously in the evaluation set could now end up in the training set and cause an error during updating. So it often makes sense to use a dedicated evaluation set at some point so you can compare the results and behaviour more reliably.