Yes, the textcat.batch-train
recipe expects binary annotations (i.e. accept/reject), whereach your recipe creates multiple choice annotations where each example has an "accept"
key with the selected labels. If you export your data using prodigy db-out
, you’ll see the format.
So you pretty much have 2 options – one is to just use spaCy directly to train a text classifier. You can call nlp.update
with a text and a dictionary of annotations. So for example: {'cats': {'HAPPY': True, 'SAD': False}}
. You can generate this format from your Prodigy annotations and then write your own littel training script.
The other option would be to convert it to the binary format, import it to a new dataset and then use textcat.batch-train
. Just keep in mind that you also need negative examples – so you probably want to create one example per text per label, mark the correct label(s) as "answer": "accept"
and all others as "answer": "reject"
. Otherwise, your model may learn that “every example is accepted!”, which gives you a great accuracy of 100% – but a really useless classifier
Here’s an example for a conversion script (updated version of the one in this thread):
from prodigy.components.db import connect
from prodigy.util import set_hashes
import copy
db = connect()
examples = db.get_dataset('your_dataset_name')
labels = ('HAPPY', 'SAD', 'ANGRY') # whatever your labels are
converted_examples = [] # export this later
for eg in examples:
if eg['answer'] != 'accept':
continue # skip if the multiple choice example wasn't accepted
for label in labels:
new_eg = copy.deepcopy(eg) # copy the example
new_eg['label'] = label # add the label
# If this label is in the selected options, mark it this example
# (text plus label) as accepted. Otherwise, mark it as rejected, i.e. wrong
new_eg['answer'] = 'accept' if label in eg['accept'] else 'reject'
# not 100% sure if necessary but set new hashes just in case
new_eg = set_hashes(new_eg, overwrite=True)
converted_examples.append(new_eg)