"prodigy train textcat ... " doesn't discard reject/ignore examples

einarbmag · February 19, 2020, 12:39pm

I may be wrong, but I don't see where in train.py the examples are being filtered down to only 'answer':'accept' examples? If I filter out the "ignore" examples manually (I don't have any "reject' in my dataset), the output displays a lower number for the total examples used, so I'm fairly sure of this.

ines · February 19, 2020, 1:16pm

For textcat annotations, this happens in the convert_options_to_cats helper. It checks whether the annotations are multiple-choice answers (with "options" etc.) or binary yes/no answers about a single "label" and will create a more consistent representation of the labels (similar to spaCy's "cats" format). It also takes into account whether the categories are mutually exclusive. "answer": "reject" is treated differently, depending on whether the annotations are on one label at a time (reject = we know label doesn't apply) or on multiple choice (reject = we know which labels don't apply, but we don't know what).

I just had a look and I think what's currently happening is that ignored answers receive "cats": {}, which should have the same effect – but it's inconsistent with the other components and does result in inconsistent numbers being reported. I've already fixed this internally and we'll include the fix in the next release

einarbmag · February 19, 2020, 1:25pm

I see, thanks for the quick response. Do you know whether examples with "cats":{} are also ignored in nlp.evaluate?

ines · February 19, 2020, 1:33pm

As far as I know, that should be the case, yes: more specifically, the Scorer should only score gold parses with one or more cats, so examples with cats = {} wouldn't be included (relevant code here).

ines · February 21, 2020, 1:43pm

Just released v1.9.7, which makes the handling of ignored examples in train consistent for textcat annotations

Topic		Replies	Views
Ignored sentences for text classification usage , textcat	10	1941	March 3, 2020
Meaning of reject in textcat.manual to textcat.batch-train usage , textcat , done	4	931	May 22, 2019
textcat.batch-train reject examples usage , textcat	1	403	September 29, 2019
Reject or skip examples for text classifier annotations usage , textcat	2	889	November 29, 2018
Multi-label Text Classification Ignore example usage , textcat , solved	4	496	October 29, 2020

"prodigy train textcat ... " doesn't discard reject/ignore examples

Related topics