I'm trying to build a classifier for food recipe categories. I want to make sure I understand the process before I begin categorizing all of the recipes.
Ultimately, what I want to end up with is a model that I can then use to push new recipes into that will then take a stab at categorizing.
I have a corpus of recipes that contain the recipe title, description and list of ingredients. Each recipe will have one categoryThe corpus (recipes.jsonl) represents all of the recipe data I have.
Ideally, I'd like to take multiple passes over this list. Each time, only looking for one category so that during classification it's a binary decision (i.e., DESSERT or no dessert.) I think that would be easier than having to pick from one of 15 different categories in each recipe. However, I'm open to suggestions.
Hi! I do think this is a very good point and your approach sounds reasonable to me
Yes, I'd say that's a good next step. Prodigy lets you train and merge annotations from multiple datasets, so if you train a single text classifier from your DESSERT and POULTRY datasets, all annotations on the same example will be merged into one, and the information the model is updated with will be something like {"DESSERT": True, "POULTRY": False} etc. This makes it very easy to break a larger classification task down into per-label datasets and still train one combined model at the end.
When you train with Prodigy (or export your annotations for spaCy) and don't specify a dedicated evaluation set, Prodigy will hold back a certain percentage of examples for evaluation. You can control this by setting the --eval-split argument.
If you already know which examples you want to use for training and development, you can already do the split upfront and create separate datasets for training and evaluation. Once you're getting serious about evaluation, it's definitely a good idea to have dedicated evaluation sets that don't change – this way, you're always evaluating on the same data, and you'll be able to compare results between training runs in a meaningul way, and check whether your model improves as you annotate more data.
You also want to make sure that your evaluation set is representative of the data your model will see at runtime, and contains examples of all labels you're predicting. Otherwise, your evaluation results may not actually reflect how good your model really works.
So assuming your evaluation set is large and representative enough, the overall accuracy should give you a good estimate of how well the classifier works. You can also look at the per-label scores to find potential problems. Maybe there's one particular label that scores worse than all others, which could indicate that you need more examples of that label, or that you have inconsistencies in the data (e.g. if the label is ambiguous).
Btw, Prodigy v1.11 introduced a new textcat.correct workflow that might also be useful once you have an initial trained text classifier. This will show you all labels, with the model's confident predictions pre-selected.
Once you're getting serious about evaluation, it's definitely a good idea to have dedicated evaluation sets that don't change – this way, you're always evaluating on the same data, and you'll be able to compare results between training runs in a meaningul way, and check whether your model improves as you annotate more data.
That's a great idea. I'm not sure if this is a dumb question but is there a rule of thumb on the size of that evaluation set?
This is definitely a good question and it really depends on the data and the label distribution – if you have lots of labels, including some that are rare, you usually want a larger evaluation set to make sure you have all labels covered. If the set is too small, your results will also become harder to interpret: if you're only evaluating on a small number of examples, even one or two individual predictions can easily make up for a few percent in accuracy difference.
In the beginning, aiming for an evaluation set of about the same size as your training set might be a good approach. So you could train on 300 examples and evaluate on 300. Once you're satisfied with your evaluation set, you can then keep it stable and train on 800, 1000, 1200 examples using the same 800 evaluation examples.