Handling Textcat Imbalanced Data

Hi @kushal_pythonist & @kushalrsharma,

We have merged your posts into one thread as they are related.

To improve your model performance with respect to underrepresented categories you could either:

  1. curate your data by upsampling the examples from these categories in your corpus
    or
  2. optimize class weights in your training protocol.

If you go with route 2) you'd probably need to train outside of Prodigy as our train command is just a thin wrapper over spacy train. With spaCy, I believe you'd need a custom model implementation to set weights. Here's a related thread on spaCy discussion board. Alternatively, you could try a more general purpose ML software such as scikit-learn with built in utilities for setting class weights.

As far as data curation route is concerned, there are several things you could try to collect more examples from the underrepresented classes:

  1. You could run a session of textcat.teach with your current model and some new data. In that workflow, Prodigy shows you the examples the model is unsure about, which most likely be your underperforming classes.

  2. You could also create a patterns file covering typical words for your underrepresented classes and bootstrap prodigy.teach with that by passing it as --patterns parameter. Additionally, you could augment the patterns with synonyms from a vector space model such as sense2vec to capture more variation. It would probably be worth playing with the vectors first to get a sense whether they are suited to your domain, though. @Vincent has posted an excellent video on how to use sense2vec here.

  3. Finally, it's probably worth checking if GPT3 can find the examples of underrepresented classes for you! You could try prompting GPT3 with some positive examples of your underrepresented classes using Prodigy textcat.openai.correct recipe. We provided a detailed documentation on how to go about that here. You could then merge the dataset(s) created with textcat.openai.correct with your original dataset and see if that improves the scores.

Finally, I'd say that data curation should be applied to the train set only. You can find some recommendations on how to split into train, dev and test with imbalanced dataset here.

Let me know if you have any problems setting up any of these data curation workflows!