E895 when training with textcat.manual --exclusive

Ok, so I'm running into some troubles doing multiple passes on a dataset.

I have a data set that I'm trying to annotate recipes.jsonl

My training process is as following

textcat.manual recipes ./recipes.jsonl --label BREAKFAST --exclusive
textcat.manual recipes ./recipes.jsonl --label DESSERT --exclusive
textcat.manual recipes ./recipes.jsonl --label POULTRY --exclusive

Then, when I ran the training I get this output:

⇒  prodigy train --textcat recipes recipe.model
ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config

=========================== Initializing pipeline ===========================
[2021-08-20 12:57:31,211] [INFO] Set up nlp object from config
Components: textcat
Merging training and evaluation data for 1 components
  - [textcat] Training: 1120 | Evaluation: 280 (20% split)
Training: 727 | Evaluation: 264
Labels: textcat (3)
[2021-08-20 12:57:31,429] [INFO] Pipeline: ['textcat']
[2021-08-20 12:57:31,433] [INFO] Created vocabulary
[2021-08-20 12:57:31,434] [INFO] Finished initializing nlp object
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 325, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/prodigy/recipes/train.py", line 276, in train
    return _train(
  File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/prodigy/recipes/train.py", line 188, in _train
    nlp = spacy_init_nlp(config, use_gpu=gpu_id)
  File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/spacy/training/initialize.py", line 82, in init_nlp
    nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
  File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/spacy/language.py", line 1273, in initialize
    proc.initialize(get_examples, nlp=self, **p_settings)
  File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/spacy/pipeline/textcat.py", line 331, in initialize
    self._validate_categories(get_examples())
  File "/Users/joec/.local/share/virtualenvs/ml-recipe-dump-SgMMmNgg/lib/python3.8/site-packages/spacy/pipeline/textcat.py", line 381, in _validate_categories
    raise ValueError(Errors.E895.format(value=ex.reference.cats))
ValueError: [E895] The 'textcat' component received gold-standard annotations with multiple labels per document. In spaCy 3 you should use the 'textcat_multilabel' component for this instead. Example of an offending annotation: {'BREAKFAST': 1.0, 'DESSERT': 1.0, 'POULTRY': 0.0}

I'm guessing I labeled something as both a dessert and breakfast.

I was able to back out of the session so I could train my model.

What's not clear to me is how that is even allowed? When I run textcat.manual annotate items in one category. Then run it again on another category should I be presented with the same material again?

What am I missing here?

Hi! If you're running textcat.manual with only one label, you're asked a text + label question and when you run it again with a different single label, the example will be interpreted as a different question about the same text, so you'll see it again. Or, in more abstract terms, if you're annotating with only one label, the tasks are excluded based on the task hash, not the input hash. This allows making multiple passes over the same data with different labels, and combining multiple annotations on the same text at the end.

The --exclude flag is currently not factored in if you only have one label, because it theoretically doesn't matter for the interface. However, I do see the argument for having it decide what do exclude and send out again, it'd just be slightly more complex and potentially a bit uninutitive. In that case, you'd only want to see examples again that you previously rejected (and never accepted), because those are the ones you don't know the answer for yet. This would require loading all previous annotations, finding all input hashes for examples that were accepted, and then skipping those in the stream. So you'd only be annotating "what's left". I'd have to test this, though, to make sure there are to unintended side-effects that are missing.

Btw, just a more general thought: if you only have one or two examples that were annotated with two labels, that could have just been a mistake/misclick. But if you have a lot of them, this could indicate that there may be some overlap between some of the categories and it might be a better fit for a multi-label classification task instead of an exclusive prediction problem? Of course, I don't know what your texts look like in detail, but something about pancakes could be both BREAKFAST and DESSERT, and the distinction can easily become quite subjective. If there are no clear indicators in the text, this can also be something that you model may struggle to learn.

In that case, you'd only want to see examples again that you previously rejected (and never accepted), because those are the ones you don't know the answer for yet.

Yeah, that was how I envisioned the --exclude flag to work. That's just how my brain works I guess :slight_smile:

My taxonomy has 13 terms so doing it all at once seems like it would be a slower process. Do you have another suggestion on how to go about tagging?

Of course, I don't know what your texts look like in detail, but something about pancakes could be both BREAKFAST and DESSERT , and the distinction can easily become quite subjective. If there are no clear indicators in the text, this can also be something that you model may struggle to learn.

That's a good point. Right now we want to be opinionated in our categorization but I could see a case for supporting multiple categories per recipe. If I wanted to switch to multilabel could I do it against the same database or would I need to start over?

I haven't tested this in detail yet but if you wanted to implement something like here, adding the following to your recipe should work:

from prodigy.components.db import connect

def filter_accepted_inputs(examples, stream):
    accepted = set()
    for eg in examples:
        if eg["answer"] == "accept":
            accepted.add(eg["_input_hash"])
    for eg in stream:
        if eg["_input_hash"] not in accepted:
            yield eg

# At the end of your recipe
if not has_options and exclusive:
    db = connect()
    if dataset in db:
        stream = filter_accepted_inputs(db.get_dataset(dataset), stream)

But yeah, this is slightly different from how the other exclude mechanisms work because it also looks at the specific answers of examples in the dataset. If it works and doesn't introduce unintended side-effects, I'd be happy to ship this as a built-in feature because it's definitely more logical.

Yes, in theory, you can use the same dataset to train an exclusive or multilabel text classifier (e.g. via --textcat vs. --textcat-multilabel in train or data-to-spacy).

Of course, if you want to frame your task as a multilabel classification problem instead, it might make sense to go over your data again and annotate some more examples with multiple categories, if there are overlaps. Queueing up the data for this should be pretty straightforward to do programmatically, though: you can collect the annotated labels for each input hash, add them to a multiple choice example as the "accept" key and then review the examples with "choice_style": "multiple". This will pre-select the already annotated label(s) and lets you tick more boxes if they apply.

1 Like

To test it I cloned the textcat.manual recipe and added the code you suggested.

Running some tests on this now. It works like I expected it.

Thanks for jumping on this quickly. The process is going much smoother now.

Update: Now shipped with v1.11.3 :tada:

1 Like