PhraseMatcher Only takes words less than 10 length

ner
spacy
solved

(Abhinandan Srivastava) #1

when I am doing using PhraseMatcher

matcher.add('ORG', None, *[nlp(text) for text in Organisation])

if United States of America comes its throwing error, I think it takes words with length less than 10.

Spacy==2.0.18


(Abhinandan Srivastava) #2

ValueError: [T002] Pattern length (10) >= phrase_matcher.max_length (10). Length can be set on initialization, up to 10.


(Ines Montani) #3

ValueError: [T002] Pattern length (10) >= phrase_matcher.max_length (10). Length can be set on initialization, up to 10.

Yes, that’s correct, spaCy’s current PhraseMatcher implementation has this limit. In the upcoming version v2.1.0, the matcher engine has been rewritten and phrase patterns won’t be limited to 10 tokens anymore.

In the meantime, you can always use the regular Matcher and create token-based patterns instead:

matcher = Matcher(nlp.vocab)
docs = nlp.pipe(Organisation)
# case-insensitive patterns
patterns = [{'lower': token.lower_} for token in doc]
# case-sensitive patterns
patterns = [{'orth': token.text} for token in doc]

(Ines Montani) #4

A post was split to a new topic: Converting data to Prodigy’s format