OP + does not appear to be greedy


(W.P. McNeill) #1

I am trying to match spans of one or more “red” or “blue” tokens as entities called COLOR. So in the sentence

the red red red cat chased the blue blue mouse

I want to see two COLOR entities: “red red red” and “blue blue”. The number of tokens in a span is variable, so I need to use a pattern quantifier.

I wrote this in my notebook

import spacy
from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load("en")
patterns = [
    [{"LOWER": "red", "OP": "+"}],
    [{"LOWER": "blue", "OP": "+"}]
matcher = Matcher(nlp.vocab)
matcher.add("COLOR", None, *patterns)

text = "the red red red cat chased the blue blue mouse"
document = nlp(text)
for match_id, start, end in matcher(document):
    document.ents += ((match_id, start, end),)
displacy.render(document, style="ent", jupyter=True)

It produces this

Which is not what I want because the entity label is applied to each individual token instead of multiple token spans. The + quantifier does not appear to be greedy. I get the same result if I omit the "OP": "+" quantifiers from the patterns.

  • Am I misunderstanding the way spaCy pattern matching quantifiers work?
  • How do I get the multi-token span annotations that I am looking for?

(Matthew Honnibal) #2

I think this is related to the issue here: https://github.com/explosion/spaCy/issues/1503

(W.P. McNeill) #3

Yep. Different qualifier, but the same behavior.