With the custom trained model, spacy tends to tag “.” “:” “~” “,” kinda punctuations as entity. The training data does not contain such entities. Why is this happening?
This is currently a flaw in spaCy’s modelling that affects low-data use cases, such as in Prodigy. It’ll be corrected in spaCy soon.
In the meantime, you might want to add either pre- or post-processes that prevent spaCy from doing this. A post-process will be easiest to implement. Something like this:
def remove_invalid_entities(doc):
"""Filter out invalid entities."""
doc.ents = [ent for ent in doc.ents if is_valid_entity(ent)]
return doc
def is_valid_entity(ent):
if len(ent) > 1:
return True
else:
word = ent[0]
if word.is_punct:
return False
elif word.is_space:
return False
else:
return True
# Add our post-process to the pipeline
nlp.add_pipe(remove_invalid_entities, after='ner')
1 Like
Thank you. Good to know
Yes. I had post-process as an option. But wanted to conform I am not missing anything from training. This is Very helpful. Thank you.
1 Like
Does it happen with numbers too? I am getting numbers as entities too.