Best practise for multi-label and textcat.teach

honnibal · November 18, 2017, 3:24pm

Glad it came at a good time for you

You’ve probably seen this, but in case you haven’t the insults classification tutorial is probably pretty relevant for you: https://www.youtube.com/watch?v=5di0KlKl0fE

I think you’ll do well creating terminology lists to bootstrap your categories. Start off with a couple of seed terms, and then build out the word list using the terms.teach recipe, as shown at the start of the insults classifier video. This will help you create initial models for each of your terms.

You want to get to the point where you have a single dataset with at least a few positive and negative examples for each of your labels. Prodigy assumes labels are not mutually exclusive, i.e. that each text can have multiple labels. If that’s not true for your domain, then you know that all examples that are positive for one class will be negative examples for the other classes. To take advantage of this knowledge, you can create ‘reject’ examples for the other classes once one class has been accepted. This logic is left for you to implement because label schemes can have complicated dependencies, e.g. some labels may be mutually exclusive, others not.

Overall I suggest you let your workflow evolve as you go. It’s a boot-strapping process: hopefully every bit of knowledge you’re adding can be used to make the knowledge collection easier. The optimal procedure for this will be different for every problem, so we’ve tried to give you a variety of tools that compose well.

You’ll find yourself moving text in and out of the database, merging records, etc. This is all by design. Similarly, you’ll want to write little bits of Python (or shell if you’re perverse enough to prefer it ) as you go. This is also by design. We wanted to avoid a problem we often find with developer tools, especially hosted ones: they often end up creating this parallel language of scripts and configurations, that’s actually just worse than Python. We assume you know at least one programming language pretty well, so we wanted to make sure we let you use it, instead of creating more arbitrary stuff.

List of built-in recipes: https://prodi.gy/docs/recipes
Text classification workflow: https://prodi.gy/docs/workflow-text-classification
Example training spaCy’s text classifier directly: https://spacy.io/usage/training#section-textcat
Also check out the custom recipes section of the docs. You can also view the source of the recipes within your Prodigy installation, as examples.

Topic		Replies	Views
textcat.batch-train question	7	495	November 28, 2022
Is textcat.teach (as out-of-the-box) appropriate with multilabel tasks? textcat , solved	4	337	June 28, 2022
textcat.teach for multi-class classification textcat	3	515	June 19, 2023
textcat.teach surprising UI for multilabel	4	234	December 15, 2022
Multi-label text classification with many labels usage , textcat	7	2402	June 30, 2020

Best practise for multi-label and textcat.teach

Related topics