Extended Pattern Lookup

k.schroeer · July 22, 2019, 7:27am

Hi Ines and Matt,

the following pattern should match e.g. “10115 Berlin”

pattern = [{"IS_DIGIT": True, "LENGTH": 5}, {"ENT_TYPE": CITY}]

if there has been an NER step further up in the pipeline, recognizing “Berlin” as a CITY entity.

Question: Is it possible to match only the number while still considering that is should be followed by a certain entity?

I know that you can do things like that in regex (only match patterX that are followed/preceded by pattern Y), but I explicitly want to (pre-)lookup a token attribute.

I want to use this to boost my manual annotation with rule based suggestions and also to use these rules with an entity ruler in the finished model to “preset” some easy to recognize entities to get higher accuracies in the following statistical model.

Thanks for your help!

honnibal · July 22, 2019, 4:32pm

Unfortunately no, we don’t currently have support for “lookaround” patterns in the Matcher. It’s an often requested feature we’d like to implement eventually: https://github.com/explosion/spaCy/issues/2262

ines · July 22, 2019, 4:55pm

One way to implement your own component for this would be to use the matcher to assign a speific label, and then narrow down the span afterwards. For instance, if your entity ruler assigns EXTENDED_CITY, create a new CITY span that starts at ent.start + 1, i.e. one token further (minus the post code). For instance:

def adjust_entity_spans(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "EXTENDED_CITY":
            new_ent = Span(doc, ent.start + 1, ent.end, label="CITY")
            new_ents.append(new_ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

nlp.add_pipe(adjust_entity_spans, after="entity_ruler")

Just note that with the approach outlined here, this will only work during the annotation phase. At runtime, you can't both use predictions by the statistical model in your entity ruler, while also running the entity ruler before the statistical named entity recognizer in the pipeline.

k.schroeer · July 23, 2019, 6:48am

Thanks Matt and Ines for your fast replies! I will definetly give Ines' entity filter a go.

My impression was, I could use for example two entity ruler in my pipline:
stream -> ... -> entity ruler 1 (dealing with cities with a dictionary) -> entity ruler 2 (dealing with postal codes, using your span correction approach) -> maybe other rulers -> statistical model

Is this correct? As mentioned before I can boost the annotation with this, but shouldn't this enhance my statistical model by 'predefining' some entities I can clearly recognize as such?

ines · July 23, 2019, 10:20am

Ahhhh, I think I misunderstood what you're trying to do, sorry. Your plan is to use the ENT_TYPE assigned by a previous entity ruler (and not a statistical model) in the subsequent entity ruler component, right? That's actually a pretty clever idea and I haven't seen this done before! So I'd definitely be curious to hear how it goes and how much you're able to boost your results like this

k.schroeer · July 23, 2019, 10:45am

Exactly, that is the plan. I can identity some of my entities (like the postal code) with a high accuracy, if I already have an existing CITY entity. For now, I'm only sure that this will boost the annotation, I hope that the model will learn the pattern even for unknown cities.

As always, thank you for the fast reply, I'll keep you updated on my endeavors

Topic		Replies	Views
(Re)using labels in patterns usage , spacy	1	315	July 21, 2021
Extended pattern performance question ner , spacy	6	775	August 12, 2019
ENT_TYPE in patterns spacy , solved	4	552	July 18, 2022
Pattern Matching: match token after known term enhancement , ner , spacy , solved	1	382	January 13, 2021
Problem with new entity type and patterns usage , ner , solved	2	817	January 8, 2019

Extended Pattern Lookup

Related topics