Ignored sentences for text classification

ines · February 8, 2019, 12:01pm

Yes, the textcat.batch-train recipe expects binary annotations (i.e. accept/reject), whereach your recipe creates multiple choice annotations where each example has an "accept" key with the selected labels. If you export your data using prodigy db-out, you’ll see the format.

So you pretty much have 2 options – one is to just use spaCy directly to train a text classifier. You can call nlp.update with a text and a dictionary of annotations. So for example: {'cats': {'HAPPY': True, 'SAD': False}}. You can generate this format from your Prodigy annotations and then write your own littel training script.

The other option would be to convert it to the binary format, import it to a new dataset and then use textcat.batch-train. Just keep in mind that you also need negative examples – so you probably want to create one example per text per label, mark the correct label(s) as "answer": "accept" and all others as "answer": "reject". Otherwise, your model may learn that “every example is accepted!”, which gives you a great accuracy of 100% – but a really useless classifier

Here’s an example for a conversion script (updated version of the one in this thread):

from prodigy.components.db import connect
from prodigy.util import set_hashes
import copy

db = connect()
examples = db.get_dataset('your_dataset_name')
labels = ('HAPPY', 'SAD', 'ANGRY')  # whatever your labels are

converted_examples = []  # export this later

for eg in examples:
    if eg['answer'] != 'accept':
        continue  # skip if the multiple choice example wasn't accepted
    for label in labels:
        new_eg = copy.deepcopy(eg)  # copy the example
        new_eg['label'] = label  # add the label
        # If this label is in the selected options, mark it this example
        # (text plus label) as accepted. Otherwise, mark it as rejected, i.e. wrong
        new_eg['answer'] = 'accept' if label in eg['accept'] else 'reject' 
        # not 100% sure if necessary but set new hashes just in case
        new_eg = set_hashes(new_eg, overwrite=True)
        converted_examples.append(new_eg)

Topic		Replies	Views
"prodigy train textcat ... " doesn't discard reject/ignore examples textcat , done	4	570	February 21, 2020
Multi-label Text Classification Ignore example usage , textcat , solved	4	491	October 29, 2020
Are 'Reject' examples included in textcat_multilabel train/train-curve?	5	246	November 19, 2022
Undesirable "ignore" examples build up with low quality input streams enhancement	5	1759	September 27, 2022
Best Practices for text classifier annotations usage , textcat , best-practices	7	4996	March 24, 2021

Ignored sentences for text classification

Related topics