Best practise for multi-label and textcat.teach

:tada:
Glad it came at a good time for you :slight_smile:

You’ve probably seen this, but in case you haven’t the insults classification tutorial is probably pretty relevant for you: https://www.youtube.com/watch?v=5di0KlKl0fE

I think you’ll do well creating terminology lists to bootstrap your categories. Start off with a couple of seed terms, and then build out the word list using the terms.teach recipe, as shown at the start of the insults classifier video. This will help you create initial models for each of your terms.

You want to get to the point where you have a single dataset with at least a few positive and negative examples for each of your labels. Prodigy assumes labels are not mutually exclusive, i.e. that each text can have multiple labels. If that’s not true for your domain, then you know that all examples that are positive for one class will be negative examples for the other classes. To take advantage of this knowledge, you can create ‘reject’ examples for the other classes once one class has been accepted. This logic is left for you to implement because label schemes can have complicated dependencies, e.g. some labels may be mutually exclusive, others not.

Overall I suggest you let your workflow evolve as you go. It’s a boot-strapping process: hopefully every bit of knowledge you’re adding can be used to make the knowledge collection easier. The optimal procedure for this will be different for every problem, so we’ve tried to give you a variety of tools that compose well.

You’ll find yourself moving text in and out of the database, merging records, etc. This is all by design. Similarly, you’ll want to write little bits of Python (or shell if you’re perverse enough to prefer it :wink: ) as you go. This is also by design. We wanted to avoid a problem we often find with developer tools, especially hosted ones: they often end up creating this parallel language of scripts and configurations, that’s actually just worse than Python. We assume you know at least one programming language pretty well, so we wanted to make sure we let you use it, instead of creating more arbitrary stuff.

1 Like