Converting choice annotations to textcat annotations



Thank you for all your help up to date.

I came across another problem that I’m not sure how to resolve. I’m using match recipe with choice view_id to classify paragraphs. I would like to use the dataset obtain in this way in textcat.batch-train recipe, so I can train the model to recognize class of the paragraph.

I know that the dataset that I currently have is not in the right format to use in the textcat.batch-train. I was trying to customize the recipe but I could not make it work.

Could you please advice me on what would be the best practice here?

Multi label tagging
(Ines Montani) #2

Sure! This shouldn’t actually be too difficult :slightly_smiling_face: The main difference of a “choice” dataset is that it has an "accept" property containing the IDs of the selected labels. The textcat.batch-train recipe on the other hand expects one example for each label.

So you could write a little script that loads the data, iterates over the examples and copies them once for each label. I haven’t tested this yet, but something along those lines should work:

from prodigy.components.db import connect
from prodigy.util import set_hashes
import copy

db = connect()
examples = db.get_dataset('your_dataset_name')

converted_examples = []  # export this later

for eg in examples:
    if eg['answer'] != 'accept':
        continue  # skip if the example wasn't accepted
    labels = eg['accept']  # the selected label(s)
    for label in labels:
        new_eg = copy.deepcopy(eg)  # copy the example
        new_eg['label'] = label  # add the label
        # not 100% sure if necessary but set new hashes just in case
        new_eg = set_hashes(new_eg, overwrite=True)

Ignored sentences for text classification
(Kris) #3

When using the above code, the accuracy returns 100%, because every example (eg) had eg[‘answer’] == ‘accept’. So I added new_eg['answer'] = label above new_eg['label'] = label. This attempted to build the training model, but it failed with the error

File "cython_src/prodigy/models/textcat.pyx", line 221, in prodigy.models.textcat.TextClassifier.evaluate
ValueError: ('positive', 0.4924214482307434)

I have three labels I’m trying to train (positive, negative, neutral), but it only looks like the model is returning the prediction for one label and then failing. Any ideas how to return an output like {‘positive’: .49, ‘negative’: .21, ‘neutral’: .30}? Thanks for the help.

(Ines Montani) #4

Oh yes, this is a good point – you usually also want to add some negative examples if you’re running textcat.batch-train. This should be easy to do, since you can just take a ll options that weren’t selected.

This can’t work, because the "answer" always needs to be one of "accept", "reject" or "ignore". But you can add logic that takes labels that you know are definitely wrong, and sets them to "answer": "reject".

Also make sure to set the --label on the command line when you’re training the model, so the labels are added correctly. For example: --label positive,negative,neutral.

(Kris) #5

Thank you for the help. When I run

prodigy textcat.batch-train my_training_data /tmp/model --eval-split 0.2 --label positive,negative,neutral

I received the following error:

usage: prodigy textcat.batch-train [-h] [-o None] [-la en] [-f 1] [-d 0.2]
                                   [-n 10] [-b 10] [-e None] [-es None] [-L]
                                   dataset [input_model]
prodigy textcat.batch-train: error: unrecognized arguments: --label positive,negative,neutral

Is there something I’m omitting that would allow this to work?

Without the --label flag, one of my labels has 99% confidence for every document I test after the model has been trained. Thanks again for the help.

(Ines Montani) #6

Oh, I’m sorry, this was my mistake! I forgot that the textcat.batch-train command will now read the labels off your data automatically, so there’s no need to specify the label.

What does your data look like? Do you have both accepted and rejected examples? What you describe here definitely indicates that the model has learned something like “everything is label X”.

(Kris) #7

Thank you so much for the help! I had a user mistake on my end which was causing all of the predictions to be the same. Once fixed, the model ran successfully and is performing great. Thanks again for the wonderful customer support.