Pattern Matching on Custom Attributes

Hi There,

I'm trying to create a pipeline component for which identifies terms to describe an "outgroup". In the following sentence, for example, an "outgroup" would be "Taliban Regime". The outgroup in this instance is based on a named entity (Taliban) and named concept (Regime).

On my orders, the United States military has begun strikes against Al Qaeda terrorist training camps and military installations of the Taliban regime in Afghanistan.

The pipeline has a component called concept recognition which uses a markup schema to annotate tokens and spans with a custom attribute and creates a doc extension of "named_concepts"

I am now trying to create final pipeline component for identifying outgroups based on a pattern combining "ENT_TYPE" and the custom attribute, All the pipeline components have been tested and are working as expected, however, I can't seem to get the pattern matching to work.

The problem seems to be in writing the correct pattern, would you be able to let me know where I'm going wrong please.

Code below:

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy.tokens import Doc

import pipeline # module for custom pipeline components

nlp = spacy.load("en_core_web_sm")

for component in nlp.pipe_names:
    if component not in ['tagger', "parser", "ner"]:
        nlp.remove_pipe(component)

# add named entity matcher component to pipeline
nlp.add_pipe(pipeline.EntityMatcher(nlp), before = "ner") # top up on named entities

# add merge entities
merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents, after = "ner")

# add concept matcher component to pipeline
nlp.add_pipe(pipeline.ConceptMatcher(nlp), after = "merge_entities") # add concepts

print(nlp.pipe_names)
# ['tagger', 'parser', 'Named Entity Matcher', 'ner', 'merge_entities', 'Concept Matcher']

class group_id(object):

    name = "group id"

    GROUP = ["NORP", "GPE", "ORG", "PERSON"]

    def __init__(self, nlp):
    
        self.nlp = nlp
    
        Doc.set_extension("outgroup_entities", default = [], force = True)
    
        self.outgroups = Matcher(nlp.vocab)
    
        self.outgroups.add("OUTGROUP", None,
                       
                        # this pattern works 
                       [{'ENT_TYPE': {"IN" : group_id.GROUP}}])
    
                        # none of these patterns work
                        #[{'ENT_TYPE': {"IN" : group_id.GROUP}}, {"_" : {"ATTRIBUTE" : {"IN" : ["outgroup"]}}}])
                        #[{'ENT_TYPE': {"IN" : group_id.GROUP}}, {"_" : {"ATTRIBUTE" : "outgroup"}}])
                        #[{"_" : {"ATTRIBUTE" : "outgroup"}}])
                        #[{"_" : {"ATTRIBUTE" : {"IN" : ["outgroup"]}}}])
                                                 
    
    def __call__(self, doc):
    
        # prints correct output confirming named entities extension is working
        # named entities:  [the United States, Al Qaeda, Taliban, Afghanistan]
        print("named entities: ", [ent for ent in doc.ents])
    
        # prints correct output confirming named concepts extension is working
        # outgroup concepts:  [terrorist, regime]
        print("outgroup concepts: ", [concept for concept in doc._.named_concepts if concept._.ATTRIBUTE == "outgroup"]) 
    
        with doc.retokenize() as retokenizer:
            matches = self.outgroups(doc)
        
            for match_id, start, end in matches:
                span = Span(doc, start, end)
            
                #returns results for the "ENT_TYPE" pattern but not for patterns trying to access custom attribute
                print(self.nlp.vocab.strings[match_id], start, end, span.text)

                doc._.outgroup_entities = list(doc._.outgroup_entities) + [span]
            
        return doc
    
if "group id" in nlp.pipe_names:
    nlp.remove_pipe("group id")

nlp.add_pipe(group_id(nlp), last = True)

text = "On my orders, the United States military has begun strikes against Al Qaeda 
terrorist training camps and military installations of the Taliban regime in Afghanistan."
output = nlp(text)
print(output._.outgroup_entities)
# ENT_TYPE pattern only returns: [the United States, Al Qaeda, Taliban, Afghanistan]
# when custom attribute pattern used returns empty list

Now resolved, and the problem is a bit embarrassing!

Posting for completeness as someone else may encounter the same situation.
TL:DR - pipeline component was annotating custom attributes on the Span() and not Token() objects within the Doc()

The problem stems from the matcher component I was using to annotate custom attributes. The code was taken from the online examples. It may be worth updating the docs to explain how Token() objects are annotated within the Doc() object for these examples.

implementation of the matcher as follows:

        def __call__(self, doc):
    
    """Apply the pipeline component on a Doc object and modify it if matches are found. 
    Return the Doc, so it can be processed by the next component in the pipeline, if available.
    
    merge entities code: https://github.com/explosion/spaCy/issues/4107
    filter code: https://github.com/explosion/spaCy/issues/4056
    """
    with doc.retokenize() as retokenizer:

        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = Span(doc, start, end)
            concept_id = self.nlp.vocab.strings[match_id]
            for token in span:
                token._.CONCEPT = concept_id
                token._.IDEOLOGY = self.get_ideology(concept_id)
                token._.ATTRIBUTE = self.get_attribute(concept_id)
            try:
                if len(span) > 1:
                    retokenizer.merge(span)
            except ValueError:
                pass
            doc._.concepts = list(doc._.concepts) + [span]

    return doc

    return doc

My faulty code is as follows:

(from __init__())
Span.set_extension("CONCEPT", default = '', force = True)
Token.set_extension("CONCEPT", default = '', force = True)

Span.set_extension("ATTRIBUTE", default = '', force = True)
Token.set_extension("ATTRIBUTE", default = '', force = True)

Span.set_extension("IDEOLOGY", default = '', force = True)
Token.set_extension("IDEOLOGY", default = '', force = True)

(from __call__())
matches = self.matcher(doc)
spans = []  # keep the spans for later so we can merge them afterwards

for match_id, start, end in matches:

    concept = Span(doc, start, end)
    concept._.CONCEPT = doc.vocab.strings[match_id]
    concept._.IDEOLOGY = self.get_ideology(concept._.CONCEPT)
    concept._.ATTRIBUTE = self.get_attribute(concept._.CONCEPT)
        
    doc._.named_concepts = spacy.util.filter_spans(list(doc._.named_concepts) + [concept])
            
    return doc

You will see from this code that the custom attributes were set on the Span() and added to doc._.named_concepts. The custom attributes were not set on each Token() within the Doc() object, which is why they were not being detected.

By creating three lookup tables, the problem was initially solved by adding getter functions to the Token(), however, this adds significant time to processing the doc:

Token.set_extension("CONCEPT", getter=get_concept, force = True)
Token.set_extension("ATTRIBUTE", getter=get_attribute, force = True)
Token.set_extension("IDEOLOGY", getter=get_ideology, force = True)

Will implement a new pipeline component that annotates at the token level