My approach might also be simply totally wrong… I have a hard time getting some training going as all that Prodigy is giving me to accept/reject are single words.
Using https://explosion.ai/demos/matcher to play around quickly with patterns… If there is a boundary word at the end of the title (like the word “program”) then it is easy.
The problem with the above pattern is that it would only match this very exact string. Patterns can only be explicit – to be able to generalise and find similar occurrences, you usually want to be training a statistical model.
From looking at your examples and how complex the patterns are, I'm also not sure if the MEMO category and approach you're going for makes sense here. The phrases you're looking for aren't really entities or proper nouns – they're almost complete sentences. So even if you can bootstrap some patterns, the entity recognizer will likely struggle to learn or predict anything meaningful here.
So maybe you should actually phrase this problem differently – for example, as a text classification task, or a combination of NER predictions and rule-based information extraction. If you haven't seen it already, you might find @honnibal's talk on this topic useful:
Starting at aroud 11:35, it also shows some common NLP problems and different annotation strategies in Prodigy.
Thank you. I can see the complexity since the titles are sentences within sentences. I think for my immediate use cases, I will be able to get what I want through rules matching instead of statistical model.
I can write my own code, but it would be nice to use SpaCy to do the matching since it has some understanding of the language, it’s more powerful than a regex.
Is there a way to create a match rule that matches “up to token X”? As in:
match all upper cased words, up to a verb in lower case? Even, match all uppercased words following X would be fine.
For the immediate need, the titles of internal memos are uppercased. There are some markers for the beginning (like DIRECTIVE 4). If I can have a match rule that selects all uppercase words and punctuation following DIRECTIVE 4 up to the first lowercase word (but not including the first lowercase word), that would be awesome!
Sure – using token rules is definitely a good approach and probably much more effective than a statistical model
You could use {'IS_UPPER': True, 'OP': '+'} to match one or more uppercase words, end the pattern with {'IS_LOWER': True, 'POS': 'VERB'} (see here in the demo), and then take all matched tokens, minus the last one?
Here's an example in code:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'DIRECTIVE'},
{'LIKE_NUM': True, 'OP': '?'},
{'IS_PUNCT': True, 'OP': '?'},
{'IS_UPPER': True, 'OP': '+'},
{'POS': 'VERB', 'IS_LOWER': True}]
matcher.add('MEMO', None, pattern)
doc = nlp("DIRECTIVE 6 - WHAT WE'RE GOING TO DO TODAY was sent to all workers today")
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end] # the matched span
span_without_verb = span[:-1]
print(span_without_verb)
# do something with the span...
Btw, if your patterns need to use statistical predictions like the part-of-speech tags or dependencies, and you find that they're not perfectly accurate on your data, you could use pos.teach or dep.teach to improve them. This will make your patterns perform even better, and it'll be quick to do, since you only have to give binary feedback.