textcat.batch-train reject examples


I have a question regarding creating reject examples for using the textcat.batch-train recipe.

I have examples of positive (accept) labels but I am creating artificially the negative (reject) examples, based on the positive examples. For example:

I have these positives examples:

Health 11065
PhysicsSci 3833
Technology 3449
Environment 3139
Energy 3000
Biology 2324
Transport 1776
Agriculture 275
Space 33
Biotechnology 13

In order to create reject examples for Energy, for example, I may get all the other categories and use those as reject examples for Energy. The problem here is that there may be texts that could be multi-labelled with Energy and Environment and I am confusing the model saying that all Environment texts are not Energy texts.

What could be the best strategy for creating reject examples?


Hi! If you don't know whether a category applies or not, it's also difficult to automatically create labelled examples for it (without actually labelling at least some of the data). If your categories are actually mutually inclusive but you only have exclusive labels, it might be easiest to just select a subset of the annotations and re-label them with the remaining categories (e.g. using textcat.manual).

The latest version of Prodigy also supports training directly from the choice format (i.e. with a key "accept": [] mapped to a list of labels). So you don't necessarily need to create separate rejected examples anymore – you just want to make sure that you also provide examples where multiple labels apply to a text, so the model can learn that.