Extended Pattern Lookup

Hi Ines and Matt,

the following pattern should match e.g. “10115 Berlin”

pattern = [{"IS_DIGIT": True, "LENGTH": 5}, {"ENT_TYPE": CITY}]

if there has been an NER step further up in the pipeline, recognizing “Berlin” as a CITY entity.

Question: Is it possible to match only the number while still considering that is should be followed by a certain entity?

I know that you can do things like that in regex (only match patterX that are followed/preceded by pattern Y), but I explicitly want to (pre-)lookup a token attribute.

I want to use this to boost my manual annotation with rule based suggestions and also to use these rules with an entity ruler in the finished model to “preset” some easy to recognize entities to get higher accuracies in the following statistical model.

Thanks for your help!

Unfortunately no, we don’t currently have support for “lookaround” patterns in the Matcher. It’s an often requested feature we’d like to implement eventually: https://github.com/explosion/spaCy/issues/2262

One way to implement your own component for this would be to use the matcher to assign a speific label, and then narrow down the span afterwards. For instance, if your entity ruler assigns EXTENDED_CITY, create a new CITY span that starts at ent.start + 1, i.e. one token further (minus the post code). For instance:

def adjust_entity_spans(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "EXTENDED_CITY":
            new_ent = Span(doc, ent.start + 1, ent.end, label="CITY")
            new_ents.append(new_ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

nlp.add_pipe(adjust_entity_spans, after="entity_ruler")

Just note that with the approach outlined here, this will only work during the annotation phase. At runtime, you can't both use predictions by the statistical model in your entity ruler, while also running the entity ruler before the statistical named entity recognizer in the pipeline.

Thanks Matt and Ines for your fast replies! I will definetly give Ines' entity filter a go.

My impression was, I could use for example two entity ruler in my pipline:
stream -> ... -> entity ruler 1 (dealing with cities with a dictionary) -> entity ruler 2 (dealing with postal codes, using your span correction approach) -> maybe other rulers -> statistical model

Is this correct? As mentioned before I can boost the annotation with this, but shouldn't this enhance my statistical model by 'predefining' some entities I can clearly recognize as such?

Ahhhh, I think I misunderstood what you're trying to do, sorry. Your plan is to use the ENT_TYPE assigned by a previous entity ruler (and not a statistical model) in the subsequent entity ruler component, right? That's actually a pretty clever idea and I haven't seen this done before! So I'd definitely be curious to hear how it goes and how much you're able to boost your results like this :slightly_smiling_face:

Exactly, that is the plan. I can identity some of my entities (like the postal code) with a high accuracy, if I already have an existing CITY entity. For now, I'm only sure that this will boost the annotation, I hope that the model will learn the pattern even for unknown cities.

As always, thank you for the fast reply, I'll keep you updated on my endeavors :wink: