"prodigy train textcat ... " doesn't discard reject/ignore examples

I may be wrong, but I don't see where in train.py the examples are being filtered down to only 'answer':'accept' examples? If I filter out the "ignore" examples manually (I don't have any "reject' in my dataset), the output displays a lower number for the total examples used, so I'm fairly sure of this.

For textcat annotations, this happens in the convert_options_to_cats helper. It checks whether the annotations are multiple-choice answers (with "options" etc.) or binary yes/no answers about a single "label" and will create a more consistent representation of the labels (similar to spaCy's "cats" format). It also takes into account whether the categories are mutually exclusive. "answer": "reject" is treated differently, depending on whether the annotations are on one label at a time (reject = we know label doesn't apply) or on multiple choice (reject = we know which labels don't apply, but we don't know what).

I just had a look and I think what's currently happening is that ignored answers receive "cats": {}, which should have the same effect – but it's inconsistent with the other components and does result in inconsistent numbers being reported. I've already fixed this internally and we'll include the fix in the next release :+1:

I see, thanks for the quick response. Do you know whether examples with "cats":{} are also ignored in nlp.evaluate?

As far as I know, that should be the case, yes: more specifically, the Scorer should only score gold parses with one or more cats, so examples with cats = {} wouldn't be included (relevant code here).

Just released v1.9.7, which makes the handling of ignored examples in train consistent for textcat annotations :slightly_smiling_face: