Robustness to error in labeled data

Ati · September 27, 2019, 7:24pm

Hello team,

I have a general question about SpaCy textcat model. Is there any study that shows how robust is the model against error in training data (data that are labeled wrongly)?
In other words, is it better to annotate more with some errors in labels or better to annotate less with higher label accuracy?

Thanks,
Ati

honnibal · September 30, 2019, 10:07am

There's no study for that unfortunately, no. It will depend on the specifics of the problem.

You could simulate this: you can introduce noise by switching some labels, and you can obviously simulate having less data (by removing some). This should allow you to plot out the trade-offs for the problems you care about. If you do this, I do hope you share the results!

One thing to keep in mind when you do the experiment is that how you simulate the errors might be important. Whatever noise is naturally occurring is unlikely to be uniformly distributed. Some types of mislabelling will be more common than others. This can be worse than uniform noise, because uniform labelling noise probably doesn't change the average gradients on the dataset, while non-uniform noise can introduce systematic biases that lead to the wrong solutions.

Topic		Replies	Views
Can't improve textcat model performance textcat	2	350	May 3, 2020
How can I improve a textcat model? usage , textcat	1	731	May 6, 2019
Labeling & Training a Textcat with Contextual / Anchor Data usage , textcat	3	436	November 13, 2020
Span Categorizer - Labels Prediction usage , training , spancat	5	422	November 18, 2021
textcat_multilabel with only some labels annotated for some examples	5	327	June 14, 2022

Robustness to error in labeled data

Related Topics