textcat.teach for multi-class classification

hi @riwikiwi!

Thanks for your question and welcome to the Prodigy community :wave:

textcat.teach assumes you have a model already: do you have one? how were the labels for that model created? why 100 classes?

One nice thing about textcat.teach is that it outputs binary annotations (as opposed to manual), which simply mean Yes/No decisions (instead of asking the user to choose from all possible labels). See Docs. So for that, an annotator will just say yes/no when presented with an example and a candidate label, reducing the cognitive load for an annotator to choose from 100+ classes at a time.

But I'm guessing you may not have a model yet, so you'd need to start with textcat.manual.

You'll save yourself a lot of headaches if you can find any way to simplify your problem, especially at first as you learn more about your data and create your first workflow:

Dealing with very large label sets or hierarchical labels

If you’re working on a task that involves more than 10 or 20 labels, it’s often better to break the annotation task up a bit more, so that annotators don’t have to remember the whole annotation scheme. Remembering and applying a complicated annotation scheme can slow annotation down a lot, and lead to much less reliable annotations. Because Prodigy is programmable, you don’t have to approach the annotations the same way you want your models to work. You can break up the work so that it’s easy to perform reliably, and then merge everything back later when it’s time to train your models.

If your annotation scheme is mutually exclusive (that is, texts receive exactly one label), you’ll often want to organize your labels into a hierarchy, grouping similar labels together. For instance, let’s say you’re working on a chat bot that supports 200 different intents. Choosing between all 200 intents will be very difficult, so you should do a first pass where you annotate much more general categories. You’d then take all the texts annotated for some general type, such as information, and set up a new annotation task to sort them into more specific subtypes. This lets the annotators study up on that part of the annotation scheme, so they can make more reliable decisions.

Docs

If you set up like a hiearchical, we recommend this approach:

Also, rules (patterns) can be really helpful. This is especially the case if you have some prior knowledge about each group and know of some terms that could start. You can even use terms.teach to generate quickly.

Also: one outside-the-box idea -- if you know the name of all 100 class (this is usually the case in a business problem where you're told the classification types), consider using LLM terms.openai.fetch link to generate related terms (patterns) for each class. I don't know how much value zero-shot recipes would do for that many categories but you can test out if you install v1.12.

Alternatively, maybe for 20-30 classes you could try something like bulk labeling:

Unfortunately, there's not one perfect solution but hopefully, this gives you a few ideas of options you can experiment with. Hope this helps!