Dealing with redundant text/dirty data in training

Hi guys,

I trying to build a dataset for annotation and training with prodigy. End goal for my use case is to get a good accuracy on the NER.

Right now I’m creating the training dataset by scraping various new articles from a 50-100 different pages. Problem is that the data is dirty eg. it has a lot of redundant text such as info about the web page and all other kinds of text on the scraped page besides the actual article. Since I’m scraping a lot of different sites it’s not really feasibly to tweak the scraper to just extract the article text (based on my understanding, please correct me if I’m wrong).

Would it work to annotate with prodigy directly on this dirty data and just exclude the junk when annotating? Or will it be necessary to clean the data before starting the annotation process? In that case, do you have any recommendations on how to do that?

Best regards,
Simon

Hi! One thing we often recommend people and that can work well is to chain two models together and start off with a classifier that detects "junk" vs. "not junk". If you get solid accuracy here, you can use this classifier to filter out all the examples you're not interested in, and have your NER model only operate on the non-junk examples. This can make your final task (named entity recognition) both easier to label and easier to learn, because you don't have to deal with the junk examples. And at runtime, you can also chain your models together like this to process new unseen examples.

I actually describe and show an example of this in my FAQ video around 6:55. Here's the standalone graphic from the video using the example of live tweets:

In Prodigy, you could use textcat.teach with a category like JUNK to boostrap your junk classifier. Using match patterns to pre-select examples could also work well here, especially for things like broken markup or other recurring things that follow a clear pattern. Once you have that working, you can process your incoming texts with that model, select the ones with a low junk score and queue them up for NER annotation.

nlp = spacy.load('./junk-model')
for doc in nlp.pipe(THE_INCOMING_TEXTS):
    if doc.cats['JUNK'] <= 0.33:  # or whichever threshold
       # do something with the doc.text here

In general, the data you're labelling and using as training data should match the data the model will see at runtime. So if you do end up pre-processing the data, you'll also have to do this at runtime. If you're only cleaning for training and evaluation, you may end up with great accuracy numbers – but your runtime model would still be completely useless.

Preprocessing at runtime is totally feasible if your model's job is to do large-scale text processing or information extraction and you have full control over the input. (If you need to respond to user input you can't control very quickly, that's not really and option.)

Btw, in case you haven't seen it yet, you might want to check out textacy, which is a really cool library for pre- and postprocessing text for spaCy. It comes with a lot of nice utilities for standardising raw text (whitespace, unicode etc.) that might be useful.

1 Like