Pettern length (12) >= phrase_matcher.max_length

Trying to bootstrap some model by loading some phrase patters for NER (new label for internal memo titles, similar to work of art titles).

I created patterns like:

{"label":"MEMO", "pattern":"DIRECTIVE 6 - WHAT WE'RE GOING TO DO TODAY"}

I get the error

[T002] Pattern length (12) >= phrase_matcher.max_length (10). Length can be set on initialization, up to 10.

Is it possible to change the max length without writing an entire new custom recipe?

My approach might also be simply totally wrong… I have a hard time getting some training going as all that Prodigy is giving me to accept/reject are single words.

I tried with patterns like this:

           {'ORTH': 'DIRECTIVE'},
           {'LIKE_NUM': True, 'OP': '?'},
           {'IS_PUNCT': True, 'OP': '?'},
           {'OP': '+'}]

Problem is that it captures only the first word after the dash. Not sure how to describe the title’s end boundary.

So I tried creating a bunch of sample patterns but they are too long!

Using https://explosion.ai/demos/matcher to play around quickly with patterns… If there is a boundary word at the end of the title (like the word “program”) then it is easy.

pattern = [{'ORTH': 'DIRECTIVE'},
           {'LIKE_NUM': True, 'OP': '?'},
           {'IS_PUNCT': True, 'OP': '?'},
           {'OP': '+', 'IS_ASCII': True},
           {'ORTH': 'PROGRAM'}]

I was able to build a few sample pattern for those typical memo titles that have an end boundary word.

For those without, I tried the following:

pattern = [{'ORTH': 'DIRECTIVE'},
           {'LIKE_NUM': True, 'OP': '?'},
           {'IS_PUNCT': True, 'OP': '?'},
           {'OP': '+', 'IS_ASCII': True},
           {'POS': 'VERB', 'OP': '!'}]

A lot of typical sentences have something like:

DIRECTIVE 6 - WHAT WE'RE GOING TO DO TODAY was sent to all workers today

The rule pattern above, however, does not work. It does not pick the words up to the verb was.

I can use the IS_UPPER as one option but not everything is written in uppercase, often, it’s mixed case too.

Example in the demo

The problem with the above pattern is that it would only match this very exact string. Patterns can only be explicit – to be able to generalise and find similar occurrences, you usually want to be training a statistical model.

From looking at your examples and how complex the patterns are, I'm also not sure if the MEMO category and approach you're going for makes sense here. The phrases you're looking for aren't really entities or proper nouns – they're almost complete sentences. So even if you can bootstrap some patterns, the entity recognizer will likely struggle to learn or predict anything meaningful here.

So maybe you should actually phrase this problem differently – for example, as a text classification task, or a combination of NER predictions and rule-based information extraction. If you haven't seen it already, you might find @honnibal's talk on this topic useful:

Starting at aroud 11:35, it also shows some common NLP problems and different annotation strategies in Prodigy.

Thank you. I can see the complexity since the titles are sentences within sentences. I think for my immediate use cases, I will be able to get what I want through rules matching instead of statistical model.

I can write my own code, but it would be nice to use SpaCy to do the matching since it has some understanding of the language, it’s more powerful than a regex.

Is there a way to create a match rule that matches “up to token X”? As in:

match all upper cased words, up to a verb in lower case? Even, match all uppercased words following X would be fine.

For the immediate need, the titles of internal memos are uppercased. There are some markers for the beginning (like DIRECTIVE 4). If I can have a match rule that selects all uppercase words and punctuation following DIRECTIVE 4 up to the first lowercase word (but not including the first lowercase word), that would be awesome!

Sure – using token rules is definitely a good approach and probably much more effective than a statistical model :+1:

You could use {'IS_UPPER': True, 'OP': '+'} to match one or more uppercase words, end the pattern with {'IS_LOWER': True, 'POS': 'VERB'} (see here in the demo), and then take all matched tokens, minus the last one?

Here's an example in code:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'DIRECTIVE'},
           {'LIKE_NUM': True, 'OP': '?'},
           {'IS_PUNCT': True, 'OP': '?'},
           {'IS_UPPER': True, 'OP': '+'},
           {'POS': 'VERB', 'IS_LOWER': True}]
matcher.add('MEMO', None, pattern)

doc = nlp("DIRECTIVE 6 - WHAT WE'RE GOING TO DO TODAY was sent to all workers today")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]  # the matched span
    span_without_verb = span[:-1]
    print(span_without_verb)
    # do something with the span...

Btw, if your patterns need to use statistical predictions like the part-of-speech tags or dependencies, and you find that they're not perfectly accurate on your data, you could use pos.teach or dep.teach to improve them. This will make your patterns perform even better, and it'll be quick to do, since you only have to give binary feedback.

I never answered! I have been using this approach and it's working pretty well! Thank you.

1 Like