Robustness to error in labeled data

Hello team,

I have a general question about SpaCy textcat model. Is there any study that shows how robust is the model against error in training data (data that are labeled wrongly)?
In other words, is it better to annotate more with some errors in labels or better to annotate less with higher label accuracy?


There's no study for that unfortunately, no. It will depend on the specifics of the problem.

You could simulate this: you can introduce noise by switching some labels, and you can obviously simulate having less data (by removing some). This should allow you to plot out the trade-offs for the problems you care about. If you do this, I do hope you share the results!

One thing to keep in mind when you do the experiment is that how you simulate the errors might be important. Whatever noise is naturally occurring is unlikely to be uniformly distributed. Some types of mislabelling will be more common than others. This can be worse than uniform noise, because uniform labelling noise probably doesn't change the average gradients on the dataset, while non-uniform noise can introduce systematic biases that lead to the wrong solutions.

1 Like