noisy data

codingnoobneedshelp · April 8, 2020, 7:08am

Hello, i want to create a simple binary classifier to classify sentences based on if they belong to the right topic (e.g. computer). The sentences will be used later to create a NER model. My problem is that my data is very noisy (wrong sentence boundary detection, spelling mistakes and sometimes wrong language). What would be the best approach here? Is it worth to try to teach the classifier to rule out the noisy data? Or should i create two different classifier here (one to rule out the noise data and one for on-/off-topic)? Or just skip the noisy data while annotating?

Thanks for any help in advise.

ines · April 8, 2020, 9:18am

Hi! It's difficult to give a definitive answer, because it always depends on your data - but yes, adding a first classification step that only classifies NOISE sounds like a good approach and it's definitely something I'd try

Many types of noise (wrong language, gibberish, leftover markup) are pretty distinct and fairly easy to detect. So if you filter those out, your "real" classifier and the downstream NER model will only have to deal with the actual text you care about and doesn't also have to learn how to deal with noise. And during annotation, you can use your noise classifier to pre-filter the data so you don't have to keep rejecting noisy examples.

codingnoobneedshelp · April 8, 2020, 10:12am

Thanks for the fast answer. I would like to try this, could you maybe give me a hint, how i can use that noisy classifier to prefilter the noisy examples while using ner.manual? Will i just have to use the trained model or are there some more steps needed?

Topic		Replies	Views
Dealing with redundant text/dirty data in training usage , ner	1	1365	February 19, 2019
Binary "pre-model" for faster annotation usage , ner , textcat	1	454	December 10, 2019
Do I need to use two models? usage , textcat	1	583	April 8, 2019
Using a text classifier instead of NER usage , ner , textcat	5	764	May 31, 2021
Help with messy data usage , ner	8	666	January 20, 2019

noisy data

Related topics