Ignored sentences for text classification

Yes, the textcat.batch-train recipe expects binary annotations (i.e. accept/reject), whereach your recipe creates multiple choice annotations where each example has an "accept" key with the selected labels. If you export your data using prodigy db-out, you’ll see the format.

So you pretty much have 2 options – one is to just use spaCy directly to train a text classifier. You can call nlp.update with a text and a dictionary of annotations. So for example: {'cats': {'HAPPY': True, 'SAD': False}}. You can generate this format from your Prodigy annotations and then write your own littel training script.

The other option would be to convert it to the binary format, import it to a new dataset and then use textcat.batch-train. Just keep in mind that you also need negative examples – so you probably want to create one example per text per label, mark the correct label(s) as "answer": "accept" and all others as "answer": "reject". Otherwise, your model may learn that “every example is accepted!”, which gives you a great accuracy of 100% – but a really useless classifier :stuck_out_tongue:

Here’s an example for a conversion script (updated version of the one in this thread):

from prodigy.components.db import connect
from prodigy.util import set_hashes
import copy

db = connect()
examples = db.get_dataset('your_dataset_name')
labels = ('HAPPY', 'SAD', 'ANGRY')  # whatever your labels are

converted_examples = []  # export this later

for eg in examples:
    if eg['answer'] != 'accept':
        continue  # skip if the multiple choice example wasn't accepted
    for label in labels:
        new_eg = copy.deepcopy(eg)  # copy the example
        new_eg['label'] = label  # add the label
        # If this label is in the selected options, mark it this example
        # (text plus label) as accepted. Otherwise, mark it as rejected, i.e. wrong
        new_eg['answer'] = 'accept' if label in eg['accept'] else 'reject' 
        # not 100% sure if necessary but set new hashes just in case
        new_eg = set_hashes(new_eg, overwrite=True)
        converted_examples.append(new_eg)