Remove tokens/word before training or prediction NER

gilbertofp · February 21, 2019, 1:07am

Hi,

I am looking the best way to eliminate emojis or broken characters inside the spacy framework. I read this: https://stackoverflow.com/questions/54617296/can-a-token-be-removed-from-a-spacy-document-during-pipeline-processing

It seems similar or almost what I want to apply. But my question is how my NER pipeline can use the custom extension attribute(
is_excluded) and ignore emojis in the training process and prediction.

A example is :

CALLING PASSIONATE RETAIL MANAGERS - NORTH SYDNEY AND BEYOND

Without the emojis we obtained better results for NER.

Let us know which could be the best practice to remove undesired words or tokens.

Best and thanks in advance,

Gilberto

honnibal · February 26, 2019, 10:04am

The Doc object has immutable text, but it should be pretty easy and quite efficient to create a new Doc object with the subset of tokens you want. Most spaCy pipeline components change the Doc in-place, but the pipeline doesn’t assume this. The component can just return a new Doc object, like this:


def filter_emojis(doc):
    non_emoji = [word for word in doc if not word._.is_emoji]
    spaces = []
    for word in non_emoji:
        if word.whitespace_:
            spaces.append(True)
        elif (word.i+1) < len(doc) and word.nbor(1)._.is_emoji:
            # If emoji separate two words, add a space.
            spaces.append(True)
        else:
            spaces.append(False)
    return Doc(doc.vocab, words=[word.text for word in non_emoji], spaces=spaces)

You should be able to insert that in your pipeline after the emoji predictor, but before the statistical components like the NER, POS tagger, etc.

That’s at runtime, though. At training time the situation’s a bit different. During the nlp.update() loop, we actually assume the components are independent. This means you should make sure the Doc and the GoldParse object are correctly set up before you pass them into nlp.update() for training.

gilbertofp · March 8, 2019, 2:42am

Thanks Honnibal for the answer, keep the great work.

Topic		Replies	Views
Include normalization/cleanup in spaCy pipeline or not spacy	2	15	July 10, 2025
Best Approach for My Project ner , spacy , project , best-practices	3	648	March 10, 2022
Spacy: remove a token spacy , off-topic	0	527	November 6, 2020
Custom English Tokenizer usage , spacy	0	533	May 7, 2019
Spacy tags punctuations usage , ner , spacy , solved	3	548	November 19, 2018

Remove tokens/word before training or prediction NER

Related topics