Inspired by this discussion, I wrote a little Prodigy recipe that preprocesses a stream of text using the textacy
package for higher-level NLP with spaCy.
The recipe takes an input source and can remove double or trailing whitespace, fix broken unicode and mojibake, convert non-ASCII characters to the closest ASCII characters and replace accented characters with unaccented. There are various other parameters available as well, which you can plug in in a similar way.
️ Disclaimer: If you want the model to learn how to deal with unclean text, it also needs to see examples of this during training. So in cases like this, it’s usually not recommended to clean up your training data (actually, quite the opposite – but that’s something for another data augmentation recipe).
To use the recipe, you need to install textacy
:
pip install textacy
Then place the following in a recipe file, e.g. recipe.py
:
import prodigy
from textacy.preprocess import normalize_whitespace, preprocess_text
import ujson
@prodigy.recipe('preprocess',
source=prodigy.recipe_args['source'],
normalize_ws=('Normalize whitespace', 'flag', 'ws', bool),
fix_unicode=('Fix broken unicode', 'flag', 'u', bool),
transliterate=('Convert non-ASCII if possible', 'flag', 't', bool),
no_accents=('Replace accented characters with unaccented', 'flag', 'na', bool))
def preprocess(source, normalize_ws=False, fix_unicode=False,
transliterate=False, no_accents=False):
stream = prodigy.get_stream(source)
for eg in stream:
text = eg['text']
if normalize_ws:
text = normalize_whitespace(text)
text = preprocess_text(text, fix_unicode=fix_unicode,
transliterate=transliterate, no_accents=no_accents)
eg['text'] = text
# write example to stdout
print(ujson.dumps(eg, escape_forward_slashes=False, ensure_ascii=False))
For an overview of the available command-line options, you can run:
prodigy preprocess --help -F recipe.py
You can preview the preprocessed stream like this:
prodigy preprocess your_data.jsonl -ws -u -na -F recipe.py | less
And then pipe it forward to another recipe – for example:
prodigy preprocess your_data.jsonl -ws -u -na -F recipe.py | ner.teach your_dataset en_core_web_sm