I created a set of rules-based classifiers with spaCy to predict the category of a given small block of text. We then had an annotator accept/reject those predictions. Now, I have a Prodigy-annotated dataset created via mark with ~16k examples. There are 27 possible labels.
The jsonl
is something like this:
{"text": "I love my S3 it's so fast", "label": "AUDI", "meta": {"id": "27236"}, "answer": "accept"}
{"text": "Look out your window at that beautiful M3", "label": "BMW", "meta": {"id": "86544"}, "answer": "accept"}
{"text": "Nothing better than a day at the track", "label": "AUDI", "meta": {"id": "108341019"}, "answer": "ignore"}
{"text": "I think he's racing a S940 Turbo", "label": "VOLVO", "meta": {"id": "3464"}, "answer": "accept"}
{"text": "Is that a bird or a plane or is it a Tesla?", "label": "BMW", "meta": {"id": "75475454"}, "answer": "reject"}
My goal is to have a model that classifies text, predicting the most likely label.
I've tried a lot of different versions of db-in
, loading models and reading both spaCy and prodigy docs and cannot figure out the proper workflow / data shapes.
I think the challenge lies in that the annotations do not have an answer
for each label for each example. E.g., using the 'Tesla' example above, each example was only shown one time with one possible prediction, so the data has no idea if that text should be "label": "TESLA", "answer": "accept"
or not - but it definitely should not be marked as "label": "TESLA", "answer": "reject"
.
Thanks for any assistance.
EDIT: It seems like this would have been the better way to kick off the annotation task - allow the annotators to correct the classifier prediction. Given that's not an option anymore, what should i do with my data?