I came across another problem that I’m not sure how to resolve. I’m using match recipe with choice view_id to classify paragraphs. I would like to use the dataset obtain in this way in textcat.batch-train recipe, so I can train the model to recognize class of the paragraph.
I know that the dataset that I currently have is not in the right format to use in the textcat.batch-train. I was trying to customize the recipe but I could not make it work.
Could you please advice me on what would be the best practice here?
Sure! This shouldn’t actually be too difficult The main difference of a “choice” dataset is that it has an "accept" property containing the IDs of the selected labels. The textcat.batch-train recipe on the other hand expects one example for each label.
So you could write a little script that loads the data, iterates over the examples and copies them once for each label. I haven’t tested this yet, but something along those lines should work:
from prodigy.components.db import connect
from prodigy.util import set_hashes
import copy
db = connect()
examples = db.get_dataset('your_dataset_name')
converted_examples = [] # export this later
for eg in examples:
if eg['answer'] != 'accept':
continue # skip if the example wasn't accepted
labels = eg['accept'] # the selected label(s)
for label in labels:
new_eg = copy.deepcopy(eg) # copy the example
new_eg['label'] = label # add the label
# not 100% sure if necessary but set new hashes just in case
new_eg = set_hashes(new_eg, overwrite=True)
converted_examples.append(new_eg)
When using the above code, the accuracy returns 100%, because every example (eg) had eg[‘answer’] == ‘accept’. So I added new_eg['answer'] = label above new_eg['label'] = label. This attempted to build the training model, but it failed with the error
File "cython_src/prodigy/models/textcat.pyx", line 221, in prodigy.models.textcat.TextClassifier.evaluate
ValueError: ('positive', 0.4924214482307434)
I have three labels I’m trying to train (positive, negative, neutral), but it only looks like the model is returning the prediction for one label and then failing. Any ideas how to return an output like {‘positive’: .49, ‘negative’: .21, ‘neutral’: .30}? Thanks for the help.
Oh yes, this is a good point – you usually also want to add some negative examples if you're running textcat.batch-train. This should be easy to do, since you can just take a ll options that weren't selected.
This can't work, because the "answer" always needs to be one of "accept", "reject" or "ignore". But you can add logic that takes labels that you know are definitely wrong, and sets them to "answer": "reject".
Also make sure to set the --label on the command line when you're training the model, so the labels are added correctly. For example: --label positive,negative,neutral.
Oh, I'm sorry, this was my mistake! I forgot that the textcat.batch-train command will now read the labels off your data automatically, so there's no need to specify the label.
What does your data look like? Do you have both accepted and rejected examples? What you describe here definitely indicates that the model has learned something like "everything is label X".
Thank you so much for the help! I had a user mistake on my end which was causing all of the predictions to be the same. Once fixed, the model ran successfully and is performing great. Thanks again for the wonderful customer support.