Ok, so I'm running into some troubles doing multiple passes on a dataset.
I have a data set that I'm trying to annotate recipes.jsonl
My training process is as following
textcat.manual recipes ./recipes.jsonl --label BREAKFAST --exclusive
textcat.manual recipes ./recipes.jsonl --label DESSERT --exclusive
textcat.manual recipes ./recipes.jsonl --label POULTRY --exclusive
Then, when I ran the training I get this output:
⇒ prodigy train --textcat recipes recipe.model
ℹ Using CPU
========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config
=========================== Initializing pipeline ===========================
[2021-08-20 12:57:31,211] [INFO] Set up nlp object from config
Components: textcat
Merging training and evaluation data for 1 components
- [textcat] Training: 1120 | Evaluation: 280 (20% split)
Training: 727 | Evaluation: 264
Labels: textcat (3)
[2021-08-20 12:57:31,429] [INFO] Pipeline: ['textcat']
[2021-08-20 12:57:31,433] [INFO] Created vocabulary
[2021-08-20 12:57:31,434] [INFO] Finished initializing nlp object
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/prodigy/__main__.py", line 61, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 325, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/prodigy/recipes/train.py", line 276, in train
return _train(
File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/prodigy/recipes/train.py", line 188, in _train
nlp = spacy_init_nlp(config, use_gpu=gpu_id)
File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/spacy/training/initialize.py", line 82, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/spacy/language.py", line 1273, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/spacy/pipeline/textcat.py", line 331, in initialize
self._validate_categories(get_examples())
File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/spacy/pipeline/textcat.py", line 381, in _validate_categories
raise ValueError(Errors.E895.format(value=ex.reference.cats))
ValueError: [E895] The 'textcat' component received gold-standard annotations with multiple labels per document. In spaCy 3 you should use the 'textcat_multilabel' component for this instead. Example of an offending annotation: {'BREAKFAST': 1.0, 'DESSERT': 1.0, 'POULTRY': 0.0}
I'm guessing I labeled something as both a dessert and breakfast.
I was able to back out of the session so I could train my model.
What's not clear to me is how that is even allowed? When I run textcat.manual annotate items in one category. Then run it again on another category should I be presented with the same material again?
What am I missing here?