I am trying to match spans of one or more “red” or “blue” tokens as entities called COLOR
. So in the sentence
the red red red cat chased the blue blue mouse
I want to see two COLOR
entities: “red red red” and “blue blue”. The number of tokens in a span is variable, so I need to use a pattern quantifier.
I wrote this in my notebook
import spacy
from spacy import displacy
from spacy.matcher import Matcher
nlp = spacy.load("en")
patterns = [
[{"LOWER": "red", "OP": "+"}],
[{"LOWER": "blue", "OP": "+"}]
]
matcher = Matcher(nlp.vocab)
matcher.add("COLOR", None, *patterns)
text = "the red red red cat chased the blue blue mouse"
document = nlp(text)
for match_id, start, end in matcher(document):
document.ents += ((match_id, start, end),)
displacy.render(document, style="ent", jupyter=True)
It produces this
Which is not what I want because the entity label is applied to each individual token instead of multiple token spans. The +
quantifier does not appear to be greedy. I get the same result if I omit the "OP": "+"
quantifiers from the patterns.
- Am I misunderstanding the way spaCy pattern matching quantifiers work?
- How do I get the multi-token span annotations that I am looking for?