Spacy tags punctuations

With the custom trained model, spacy tends to tag “.” “:” “~” “,” kinda punctuations as entity. The training data does not contain such entities. Why is this happening?

This is currently a flaw in spaCy’s modelling that affects low-data use cases, such as in Prodigy. It’ll be corrected in spaCy soon.

In the meantime, you might want to add either pre- or post-processes that prevent spaCy from doing this. A post-process will be easiest to implement. Something like this:


def remove_invalid_entities(doc):
    """Filter out invalid entities."""
    doc.ents = [ent for ent in doc.ents if is_valid_entity(ent)]
    return doc

def is_valid_entity(ent):
        if len(ent) > 1:
            return True
        else:
            word = ent[0]
            if word.is_punct:
                return False
            elif word.is_space:
                return False
            else:
                return True

# Add our post-process to the pipeline
nlp.add_pipe(remove_invalid_entities, after='ner')
1 Like

Thank you. Good to know :slight_smile:
Yes. I had post-process as an option. But wanted to conform I am not missing anything from training. This is Very helpful. Thank you.

1 Like

Does it happen with numbers too? I am getting numbers as entities too.