I am looking the best way to eliminate emojis or broken characters inside the spacy framework. I read this: https://stackoverflow.com/questions/54617296/can-a-token-be-removed-from-a-spacy-document-during-pipeline-processing
It seems similar or almost what I want to apply. But my question is how my NER pipeline can use the custom extension attribute(
is_excluded) and ignore emojis in the training process and prediction.
A example is :
CALLING PASSIONATE RETAIL MANAGERS - NORTH SYDNEY AND BEYOND
Without the emojis we obtained better results for NER.
Let us know which could be the best practice to remove undesired words or tokens.
Best and thanks in advance,