Hello, i want to create a simple binary classifier to classify sentences based on if they belong to the right topic (e.g. computer). The sentences will be used later to create a NER model. My problem is that my data is very noisy (wrong sentence boundary detection, spelling mistakes and sometimes wrong language). What would be the best approach here? Is it worth to try to teach the classifier to rule out the noisy data? Or should i create two different classifier here (one to rule out the noise data and one for on-/off-topic)? Or just skip the noisy data while annotating?
Thanks for any help in advise.