noisy data

Hello, i want to create a simple binary classifier to classify sentences based on if they belong to the right topic (e.g. computer). The sentences will be used later to create a NER model. My problem is that my data is very noisy (wrong sentence boundary detection, spelling mistakes and sometimes wrong language). What would be the best approach here? Is it worth to try to teach the classifier to rule out the noisy data? Or should i create two different classifier here (one to rule out the noise data and one for on-/off-topic)? Or just skip the noisy data while annotating?

Thanks for any help in advise.

Hi! It's difficult to give a definitive answer, because it always depends on your data - but yes, adding a first classification step that only classifies NOISE sounds like a good approach and it's definitely something I'd try :slightly_smiling_face:

Many types of noise (wrong language, gibberish, leftover markup) are pretty distinct and fairly easy to detect. So if you filter those out, your "real" classifier and the downstream NER model will only have to deal with the actual text you care about and doesn't also have to learn how to deal with noise. And during annotation, you can use your noise classifier to pre-filter the data so you don't have to keep rejecting noisy examples.

Thanks for the fast answer. I would like to try this, could you maybe give me a hint, how i can use that noisy classifier to prefilter the noisy examples while using ner.manual? Will i just have to use the trained model or are there some more steps needed?