Workflow for textcat with imbalanced/skewed data

I have a textcat-multilabel dataset that I need to train from. I want to freeze some of the data for testing while the rest can be used for training and evaluation. My data is very unskewed; like 5% POSTIVE vs 95% NEGATIVE. Does the following make sense:

First split the dataset into training and testing datasets, e.g. with eval split of 20%. From now on I can use testing as a dataset never to be used for training. Then each time I have to retrain I run the following steps.

  1. Run data-to-spacy on training with 20% eval split so I get train.spacy and dev.spacy (in case new annotations were added).
  2. Run data-to-spacy on testing with 0% eval split so I get test.spacy (in case new annotations were added).
  3. Validate that there are no overlaps between the three datasets. Are there a built-in command for this?
  4. Pre-balance training dataset by discarding a lot NEGATIVE labels. Is this recommended?
  5. Start training the model on train.spacy and dev.spacy as evaluation data.
  6. Run final evaulation on test.spacy.

I'm planning to create this as a spacy project.

Bonus question

Is data-to-spacy not available in spacy? Preferebly my spacy project doesn't depend on prodigy (and thereby also determines the spacy version) just to be able to run data-to-spacy

Do I need to provide more information?

Hi @NIX411,

Your proposed workflow does make sense with a couple of caveats:

  • re: step 3: Currently there is no separate built-in command for validating that there is no overlap between the datasets. As of next Prodigy release the improved --eval-split functionality will take care of this. For now, though, you'd have to validate it yourself. Please check this thread , which discusses this very issue.
  • re step 4: Whether downsampling the majority class (as you propose) is a good idea, depends on how much data you have overall. If it's not too much, it might be considered wasteful. You might want to look at other techniques as well such as upsampling or a combination thereof. It is recommended to experiment with some of techniques for dealing with unbalanced text datasets and see what works best in your scenario.

And about the bonus question: Once you are done with the annotation and have exported your data using data-to-spacy you no longer depend on Prodigy for the development of your spaCy project. Doing the conversion other way round i.e from spaCy, would entail the dependence of spaCy on Prodigy which is something we really do not want to introduce.
Hope that helps!

1 Like

Thanks @magdaaniol. That's helpful.

Quick follow up on step 4. Which dataset would you balance? I assume train.spacy but what about dev.spacy. I assume you want to keep test.spacy without balancing of course.

Hi @nix411:
You're right in your assumption that test.spacy should be left as is i.e. reflecting the real-world distribution of classes. You want to evaluate on a sample that is as close to the expected production setting as possible. The same applies to dev.spacy. The development set is a proxy to the test set to help you make informed decisions. You want it to be as close to real-world as possible as well. To answer your question: all the data-related modifications should be done on train.spacy only.

1 Like