Workflow for textcat with imbalanced/skewed data

magdaaniol · February 25, 2022, 11:41am

Your proposed workflow does make sense with a couple of caveats:

re: step 3: Currently there is no separate built-in command for validating that there is no overlap between the datasets. As of next Prodigy release the improved --eval-split functionality will take care of this. For now, though, you'd have to validate it yourself. Please check this thread , which discusses this very issue.
re step 4: Whether downsampling the majority class (as you propose) is a good idea, depends on how much data you have overall. If it's not too much, it might be considered wasteful. You might want to look at other techniques as well such as upsampling or a combination thereof. It is recommended to experiment with some of techniques for dealing with unbalanced text datasets and see what works best in your scenario.

And about the bonus question: Once you are done with the annotation and have exported your data using data-to-spacy you no longer depend on Prodigy for the development of your spaCy project. Doing the conversion other way round i.e from spaCy, would entail the dependence of spaCy on Prodigy which is something we really do not want to introduce.
Hope that helps!

Topic		Replies	Views
Text classification scoring usage , textcat , custom	1	616	March 24, 2020
How to compare performance of 2 textcat models usage , textcat	1	371	March 23, 2020
Handling train / dev / test in Prodigy usage , ner , training	3	577	July 22, 2021
data-to-spacy training examples also in evaluation data database , spacy , to-be-released	8	1588	January 21, 2022
Exporting dataset from prodigy and train textcat in spaCy v3 textcat , done , spacy	6	893	August 12, 2021

Workflow for textcat with imbalanced/skewed data

Related topics