Tip: Preprocessing text (whitespace, unicode) with textacy

Inspired by this discussion, I wrote a little Prodigy recipe that preprocesses a stream of text using the textacy package for higher-level NLP with spaCy.

The recipe takes an input source and can remove double or trailing whitespace, fix broken unicode and mojibake, convert non-ASCII characters to the closest ASCII characters and replace accented characters with unaccented. There are various other parameters available as well, which you can plug in in a similar way.

:warning:️ Disclaimer: If you want the model to learn how to deal with unclean text, it also needs to see examples of this during training. So in cases like this, it’s usually not recommended to clean up your training data (actually, quite the opposite – but that’s something for another data augmentation recipe).

To use the recipe, you need to install textacy:

pip install textacy

Then place the following in a recipe file, e.g. recipe.py:

import prodigy
from textacy.preprocess import normalize_whitespace, preprocess_text
import ujson

@prodigy.recipe('preprocess',
    source=prodigy.recipe_args['source'],
    normalize_ws=('Normalize whitespace', 'flag', 'ws', bool),
    fix_unicode=('Fix broken unicode', 'flag', 'u', bool),
    transliterate=('Convert non-ASCII if possible', 'flag', 't', bool),
    no_accents=('Replace accented characters with unaccented', 'flag', 'na', bool))
def preprocess(source, normalize_ws=False, fix_unicode=False, 
               transliterate=False, no_accents=False):
    stream = prodigy.get_stream(source)
    for eg in stream:
        text = eg['text']
        if normalize_ws:
            text = normalize_whitespace(text)
        text = preprocess_text(text, fix_unicode=fix_unicode,
                               transliterate=transliterate, no_accents=no_accents)
        eg['text'] = text
        # write example to stdout
        print(ujson.dumps(eg, escape_forward_slashes=False, ensure_ascii=False))

For an overview of the available command-line options, you can run:

prodigy preprocess --help -F recipe.py

You can preview the preprocessed stream like this:

prodigy preprocess your_data.jsonl -ws -u -na -F recipe.py | less

And then pipe it forward to another recipe – for example:

prodigy preprocess your_data.jsonl -ws -u -na -F recipe.py | ner.teach your_dataset en_core_web_sm

Hello! I am currently working with social-media text, and I was questioning if the following preprocessing is the "correct-practice." I want to increase the model's accuracy in identifying various classes from the textcat annotations. So I have avoided doing any preprocessing that could modify and change the original form of the text. Here is a sample of how most of all the documents look like before and after minor processing. Should I get rid of the newlines and extra spacing?

  • Before preprocessing:
raw_docs = [
'next video : bill guesses how much    luxurious cars costs',
'Good night now,I am really tired so I am going to bed\nGood luck with your next video 💤💤💤💤😴😴😴😴5mins'
]
  • After preprocessing ( spacy's sentence tokenizer and encoding ASCII ):
sents = [
'next video : bill guesses how much    luxurious cars costs',
'Good night now,I am really tired',
'so I am going to bed\n',
'Good luck with your next video 5mins'
]

  • update

I watched the FAQ video where the topic "What if I need to label long texts" is discussed. I found the information I needed! This tool has made experimenting a lot easier, thank you to all at spacy/prodigy :slight_smile:

Glad to hear you solved the problem!

One thing you might try as well is a little extra normalization. Specifically, normalizing the punctuation a bit might help slightly as well, depending on the specifics of your data. It's less important for text classification, but if you're working with the other models, it might help.