Help with postprocessing annotated data for training multicategory text classification model

I created a set of rules-based classifiers with spaCy to predict the category of a given small block of text. We then had an annotator accept/reject those predictions. Now, I have a Prodigy-annotated dataset created via mark with ~16k examples. There are 27 possible labels.

The jsonl is something like this:

{"text": "I love my S3 it's so fast", "label": "AUDI", "meta": {"id": "27236"}, "answer": "accept"}
{"text": "Look out your window at that beautiful M3", "label": "BMW", "meta": {"id": "86544"}, "answer": "accept"}
{"text": "Nothing better than a day at the track", "label": "AUDI", "meta": {"id": "108341019"}, "answer": "ignore"}
{"text": "I think he's racing a S940 Turbo", "label": "VOLVO", "meta": {"id": "3464"}, "answer": "accept"}
{"text": "Is that a bird or a plane or is it a Tesla?", "label": "BMW", "meta": {"id": "75475454"}, "answer": "reject"}

My goal is to have a model that classifies text, predicting the most likely label.

I've tried a lot of different versions of db-in, loading models and reading both spaCy and prodigy docs and cannot figure out the proper workflow / data shapes.

I think the challenge lies in that the annotations do not have an answer for each label for each example. E.g., using the 'Tesla' example above, each example was only shown one time with one possible prediction, so the data has no idea if that text should be "label": "TESLA", "answer": "accept" or not - but it definitely should not be marked as "label": "TESLA", "answer": "reject".

Thanks for any assistance.

EDIT: It seems like this would have been the better way to kick off the annotation task - allow the annotators to correct the classifier prediction. Given that's not an option anymore, what should i do with my data?

Hi! Just to make sure I understand your question correctly: you have an annotated dataset and your categories are mututally exclusive (or not?). And you only have sparse annotations, so you might know that one text isn't about Tesla or BMW, but you don't know what other label(s) it's about?

In general, it's no problem to update the text classifier with incomplete information and if you use the train command, Prodigy should merge the data and handle it accordingly. The dataset you've collected with mark should be in the correct format already ("text" plus "label" plus "answer"), so you should be able to run training experiments from that directly.

Under the hood, you may only be updating with the categories like {"TESLA": 0.0, "BMW": 0.0} for some examples, which may be less effective than updating with values for all categories, including the ones that apply. But it still moves the model in the right direction.

Thank you for your helpful answer! The problem I was stuck on (training textcat would only using "accept" answers) seems to have been fixed via the new, slightly clearer Prodigy API. Took the dive and updated and the model seems to be training.

One related follow-up question: is there a way to do a stratified train/test split? Or is the preferred method to move over to spaCy at that point?

Glad it's working!

Once you get to a point where you want to be more specific about how you sample your training and test data (and you're not just running quick experiments to see if you're on the right track), you might want to do that in a separate step, yes.

Prodigy's default --eval-split setting on the train recipes will just hold back a given percentage of the (shuffled) training exaples. That's also how the data-to-spacy recipe does it, if you define a split. The --eval-id on the train recipe lets you pass in the name of a Prodigy dataset that should be used for evaluation. So in theory, you could also use that to provide your own custom evaluation set.

2 Likes