Handling Textcat Imbalanced Data

I have used textcat for annotating the data. I got the annotated data as an imbalanced datasets. Now I want to solve the Imbalanced problems here. Any good resources available for doing so? Appreciated

{"text":"Professional Summary:","_input_hash":1038334841,"_task_hash":355466170,"options":[{"id":"PROFILES","text":"PROFILES"},{"id":"EXPERIENCES","text":"EXPERIENCES"},{"id":"ACADEMICS","text":"ACADEMICS"},{"id":"REWARDS","text":"REWARDS"}],"_view_id":"choice","config":{"choice_style":"single"},"accept":[],"answer":"ignore","_timestamp":1675933877}

{"text":"6+ years of robust and diverse experience in Manual and Automation testing with Windows and Web-based applications. Experience in Publishing & healthcare industry. ","_input_hash":-1341372683,"_task_hash":43111287,"options":[{"id":"PROFILES","text":"PROFILES"},{"id":"EXPERIENCES","text":"EXPERIENCES"},{"id":"ACADEMICS","text":"ACADEMICS"},{"id":"REWARDS","text":"REWARDS"}],"_view_id":"choice","accept":["EXPERIENCES"],"config":{"choice_style":"single"},"answer":"accept","_timestamp":1675933940}

{"text":"Core skills: Functionality testing, GUI testing, UAT, Integration testing, System testing, End to End testing, Smoke testing, Sanity testing, Data-Driven testing, Regression testing, Performance testing.","_input_hash":-1813196757,"_task_hash":-668551618,"options":[{"id":"PROFILES","text":"PROFILES"},{"id":"EXPERIENCES","text":"EXPERIENCES"},{"id":"ACADEMICS","text":"ACADEMICS"},{"id":"REWARDS","text":"REWARDS"}],"_view_id":"choice","accept":["EXPERIENCES"],"config":{"choice_style":"single"},"answer":"accept","_timestamp":1675934007}

I have this sample of data obtained from textcat. In this datasets, lots of EXPERIENCES are there other than rewards, profiles, academics this means the model is inclined towards the EXPERIENCES. But i dont want to be this and want to handle imbalanced datasets. Is there any way that will help my model to perform best or handling the imbalancing. Suggestions and Example link are highly appreciated. Thank You :slight_smile:

Hi @kushal_pythonist & @kushalrsharma,

We have merged your posts into one thread as they are related.

To improve your model performance with respect to underrepresented categories you could either:

  1. curate your data by upsampling the examples from these categories in your corpus
    or
  2. optimize class weights in your training protocol.

If you go with route 2) you'd probably need to train outside of Prodigy as our train command is just a thin wrapper over spacy train. With spaCy, I believe you'd need a custom model implementation to set weights. Here's a related thread on spaCy discussion board. Alternatively, you could try a more general purpose ML software such as scikit-learn with built in utilities for setting class weights.

As far as data curation route is concerned, there are several things you could try to collect more examples from the underrepresented classes:

  1. You could run a session of textcat.teach with your current model and some new data. In that workflow, Prodigy shows you the examples the model is unsure about, which most likely be your underperforming classes.

  2. You could also create a patterns file covering typical words for your underrepresented classes and bootstrap prodigy.teach with that by passing it as --patterns parameter. Additionally, you could augment the patterns with synonyms from a vector space model such as sense2vec to capture more variation. It would probably be worth playing with the vectors first to get a sense whether they are suited to your domain, though. @Vincent has posted an excellent video on how to use sense2vec here.

  3. Finally, it's probably worth checking if GPT3 can find the examples of underrepresented classes for you! You could try prompting GPT3 with some positive examples of your underrepresented classes using Prodigy textcat.openai.correct recipe. We provided a detailed documentation on how to go about that here. You could then merge the dataset(s) created with textcat.openai.correct with your original dataset and see if that improves the scores.

Finally, I'd say that data curation should be applied to the train set only. You can find some recommendations on how to split into train, dev and test with imbalanced dataset here.

Let me know if you have any problems setting up any of these data curation workflows!