Spacy tags punctuations

Arul · November 16, 2018, 9:58am

With the custom trained model, spacy tends to tag “.” “:” “~” “,” kinda punctuations as entity. The training data does not contain such entities. Why is this happening?

honnibal · November 17, 2018, 12:35pm

This is currently a flaw in spaCy’s modelling that affects low-data use cases, such as in Prodigy. It’ll be corrected in spaCy soon.

In the meantime, you might want to add either pre- or post-processes that prevent spaCy from doing this. A post-process will be easiest to implement. Something like this:


def remove_invalid_entities(doc):
    """Filter out invalid entities."""
    doc.ents = [ent for ent in doc.ents if is_valid_entity(ent)]
    return doc

def is_valid_entity(ent):
        if len(ent) > 1:
            return True
        else:
            word = ent[0]
            if word.is_punct:
                return False
            elif word.is_space:
                return False
            else:
                return True

# Add our post-process to the pipeline
nlp.add_pipe(remove_invalid_entities, after='ner')

Arul · November 17, 2018, 8:17pm

Thank you. Good to know
Yes. I had post-process as an option. But wanted to conform I am not missing anything from training. This is Very helpful. Thank you.

Arul · November 19, 2018, 7:38pm

Does it happen with numbers too? I am getting numbers as entities too.

Topic		Replies	Views
Punctuation (dot) breaks entity prediction on two ner , spacy	2	483	March 18, 2021
Custom NER model usage , ner , spacy	6	1402	April 15, 2019
spaCy, prodigy, annotation usage , ner , solved	2	720	February 8, 2019
NER detection and comma (,) ner	5	2130	March 28, 2018
NER not containing <word_list> usage , ner , spacy	11	1242	September 9, 2019

Spacy tags punctuations

Related topics