Excluding patterns for NER

ldorigo · May 9, 2019, 9:14am

Hi there,

Is there any way I can specify words or patterns that are not entities?

I am training a NER model from scratch (using the workflow that Matthew describes in his video). For some reason the model keeps annotating single words such as prepositions, determinants or even punctuation as entities (in particular it has asked me a huge amount of times whether “with” was an entity, despite me rejecting it every time). Is there a way I can tell it that those are not entities, so that I don’t have to manually reject them every time?

Cheers,

Luca

ines · May 9, 2019, 1:49pm

Hi! This sounds similar to the antipatterns request here:

We don't currently have that implemented out of the box, but you could add a filter to the stream at the very end of the recipe that explicitly doesn't send out an example if the span text is part of an exclude list. For example:

def filter_stream(stream):
    exclude_list = ("with", ".", ",")  # etc.
    for eg in stream:
        span = eg["spans"][0]
        if eg["text"][span["start"]:span["end"]] not in exclude_list:
            yield eg

# End of the recipe
stream = filter_stream(stream)

However, this also means that you won't get to annotate it. From what you describe, it sounds like your model is a bit "lost" and possibly doesn't get to see enough positive examples, so it starts suggesting a lot of very random tokens over and over again. Are you able to add more patterns to help bootstrap the suggestions? Alternatively, it's also possible that your use case just needs the model to be pre-trained more before you can start annotating with the model in the loop. So you might want to experiment with doing some manual annotation first so the model knows at least something about the entity type.

ldorigo · May 9, 2019, 3:50pm

The model actually has a somewhat weird behavior, where it makes a couple of very accurate (or at least sensible) predictions before getting caught up with a nonsense annotation (such as “with” or “in” or punctuation) and asking about 10 of them in a row (often the exact same word/symbol).

You are right about it not having enough initial data, I’m now trying to annotate a substantial amount manually and will see if it works better.

Topic		Replies	Views
NER exclusion patterns usage , ner	1	499	April 12, 2019
Feature Request: Antipatterns enhancement	2	1174	February 21, 2018
Forcing NER to ignore stopwords ner , terms , solved	8	1902	June 10, 2018
NER not containing <word_list> usage , ner , spacy	11	1248	September 9, 2019
Feedback on NER recipes documentation docs , ner , done	2	451	May 12, 2020

Excluding patterns for NER

Related topics