OP + does not appear to be greedy

wpm · January 2, 2018, 3:56pm

I am trying to match spans of one or more “red” or “blue” tokens as entities called COLOR. So in the sentence

the red red red cat chased the blue blue mouse

I want to see two COLOR entities: “red red red” and “blue blue”. The number of tokens in a span is variable, so I need to use a pattern quantifier.

I wrote this in my notebook

import spacy
from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load("en")
patterns = [
    [{"LOWER": "red", "OP": "+"}],
    [{"LOWER": "blue", "OP": "+"}]
]
matcher = Matcher(nlp.vocab)
matcher.add("COLOR", None, *patterns)

text = "the red red red cat chased the blue blue mouse"
document = nlp(text)
for match_id, start, end in matcher(document):
    document.ents += ((match_id, start, end),)
displacy.render(document, style="ent", jupyter=True)

It produces this

Which is not what I want because the entity label is applied to each individual token instead of multiple token spans. The + quantifier does not appear to be greedy. I get the same result if I omit the "OP": "+" quantifiers from the patterns.

Am I misunderstanding the way spaCy pattern matching quantifiers work?
How do I get the multi-token span annotations that I am looking for?

honnibal · January 8, 2018, 6:10pm

I think this is related to the issue here: https://github.com/explosion/spaCy/issues/1503

wpm · January 8, 2018, 6:12pm

Yep. Different qualifier, but the same behavior.

Topic		Replies	Views
✨ Tip: Test your patterns with our new Matcher Explorer demo spacy , project	4	2340	May 8, 2023
(Re)using labels in patterns usage , spacy	1	315	July 21, 2021
Pattern Matching: match token after known term enhancement , ner , spacy , solved	1	382	January 13, 2021
Problem with new entity type and patterns usage , ner , solved	2	817	January 8, 2019
ner.manual: issue to recognize multi-words entity containing "-" usage , spacy , solved	2	308	June 15, 2021

OP + does not appear to be greedy

Related topics