Remove tokens/word before training or prediction NER


I am looking the best way to eliminate emojis or broken characters inside the spacy framework. I read this:

It seems similar or almost what I want to apply. But my question is how my NER pipeline can use the custom extension attribute(
is_excluded) and ignore emojis in the training process and prediction.

A example is :

:gift_heart::mega:CALLING PASSIONATE RETAIL MANAGERS :gift_heart::mega: - NORTH SYDNEY AND BEYOND :point_right::point_right:

Without the emojis we obtained better results for NER.

Let us know which could be the best practice to remove undesired words or tokens.

Best and thanks in advance,


The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want. Most spaCy pipeline components change the Doc in-place, but the pipeline doesn’t assume this. The component can just return a new Doc object, like this:

def filter_emojis(doc):
    non_emoji = [word for word in doc if not word._.is_emoji]
    spaces = []
    for word in non_emoji:
        if word.whitespace_:
        elif (word.i+1) < len(doc) and word.nbor(1)._.is_emoji:
            # If emoji separate two words, add a space.
    return Doc(doc.vocab, words=[word.text for word in non_emoji], spaces=spaces)

You should be able to insert that in your pipeline after the emoji predictor, but before the statistical components like the NER, POS tagger, etc.

That’s at runtime, though. At training time the situation’s a bit different. During the nlp.update() loop, we actually assume the components are independent. This means you should make sure the Doc and the GoldParse object are correctly set up before you pass them into nlp.update() for training.

Thanks Honnibal for the answer, keep the great work.