Excluding patterns for NER

Hi there,

Is there any way I can specify words or patterns that are not entities?

I am training a NER model from scratch (using the workflow that Matthew describes in his video). For some reason the model keeps annotating single words such as prepositions, determinants or even punctuation as entities (in particular it has asked me a huge amount of times whether “with” was an entity, despite me rejecting it every time). Is there a way I can tell it that those are not entities, so that I don’t have to manually reject them every time?



Hi! This sounds similar to the antipatterns request here:

We don't currently have that implemented out of the box, but you could add a filter to the stream at the very end of the recipe that explicitly doesn't send out an example if the span text is part of an exclude list. For example:

def filter_stream(stream):
    exclude_list = ("with", ".", ",")  # etc.
    for eg in stream:
        span = eg["spans"][0]
        if eg["text"][span["start"]:span["end"]] not in exclude_list:
            yield eg

# End of the recipe
stream = filter_stream(stream)

However, this also means that you won't get to annotate it. From what you describe, it sounds like your model is a bit "lost" and possibly doesn't get to see enough positive examples, so it starts suggesting a lot of very random tokens over and over again. Are you able to add more patterns to help bootstrap the suggestions? Alternatively, it's also possible that your use case just needs the model to be pre-trained more before you can start annotating with the model in the loop. So you might want to experiment with doing some manual annotation first so the model knows at least something about the entity type.

The model actually has a somewhat weird behavior, where it makes a couple of very accurate (or at least sensible) predictions before getting caught up with a nonsense annotation (such as “with” or “in” or punctuation) and asking about 10 of them in a row (often the exact same word/symbol).

You are right about it not having enough initial data, I’m now trying to annotate a substantial amount manually and will see if it works better.