Text classification with multiple exclusive labels and unbalanced classes

I am attempting to use textcat to identify sentences from recipe articles which can be labelled in one of six different classes. These are ingredients, instructions, timings, servings, recipe titles or ingredients list headers (e.g. "For the pastry:" preceding a list of ingredients). The first two labels are far more common than the others, partly as there are far more sentences per recipe which fall into these categories, and partly because not all recipes contain the less common categories at all.

Based on the textcat page in the prodigy docs, I started by looking at the textcat.teach method and seeding each of the categories, but I found this quite slow with even just two labels so it didn't seem workable for six. So I then moved to trying a textcat.manual method, but I think this is suffering from the lack of instances for some of the labels, as the scores for these when I train vary quite a bit each time and tend to sit between 0.4 and 0.6 for "timings" and "servings". These two classes should in principle be easy to identify, as sentences generally have similar content e.g. "Serves 2/Makes 4" and "Cook 20 min. / Prep 10 min."

So now I am wondering what the best approach for this scenario would be. Should I try and construct a more balanced dataset for the manual training by pulling sentences out of my articles which contain e.g. "Serves/Makes/Cook/Prep" and continue with the same model, or do I really need to use teach here? I was also unsure if there is a way to use different methods for different labels with one resultant model? Or is there another way to deal with this kind of scenario which I haven't thought of?

Additionally, in my textcat.manual command I had set --exclusive as a sentence cannot be more than one class. However when I added --textcat-exclusive flag to the subsequent train command it gave zero F-scores for every label except ingredient or instruction. Should I be using either/both of these flags in my commands?

Any input would be very helpful!

Cheers,
Zara

I think for your problem, you can probably use the document structure a bit more to help you bootstrap the annotation. For instance, the recipe title is always going to be very near the start of the document. It doesn't really make sense to look at every sentence of the document and ask "Is this the title?". Similarly, you might have constraints like only one ingredient header per document, and the ingredient header always has to come before the ingredients.

If you're ingesting articles from the web, and you have a lot of articles from the same source, you might also be able to use the html structure for that source to give you a lot of the information.

Machine learning projects are often a balancing act between using totally generic techniques so that you don't have to think about the specifics of your problem, or on the other hand exploiting data- and problem-specific things that make the problem much easier. The generic approach moves more of the complexity onto the model, so the modelling has to do more work. But sometimes it's better to do a little more specific processing, and you can make the model's job much easier.

Thanks for your quick response!

Actually, this project was attempted before using exclusively the html tags for articles, but due to the wide variety of content and formatting used in the articles, it required a lot of human curation afterwards. So this was an attempt to create an alternative which would be less affected by changes in style and formatting within the articles. Perhaps I could initially pull out information using tags and verify this with a text classification task...

I should explain that the recipe title is of particular interest in this case for articles containing multiple recipes, where it can be hard to distinguish where a new recipe begins (rather than there just being a headed paragraph or ingredient list title). Similarly for the ingredient list titles, these don't always occur in the same location or with the same tags, and recipes will have varying numbers (e.g. one might have "For the salad", "For the dressing", "To serve" and another may have none).

Would you be able to clarify the use of the exclusive flag though please - i.e. should it apply to this sort of classification and does the flag for manual correspond to the flag for the training?...

Cheers,
Zara

Sure thing. Basically there are some label systems where you have a lot of categories that might apply simultaneously. For instance, you might be tagging the recipes for whether they're quick to make, and whether they're healthy, and whether they're vegetarian. Obviously a recipe can be all three, so you don't want to have an exclusive category system for that. You don't want the model to learn that saying a recipe is healthy means it isn't vegetarian.

I do think in your case your labels will be mutually exclusive, so you probably want the exclusive setting. But it comes down to the label scheme and how you approach the problem. What you should avoid though is a label scheme where some settings are mutually exclusive, and others aren't.