Lookingaround with Prodigy / Spacy matcher semantics

I want to identify times in a dialogue. So the natural text might be something like:

"I will meet you at 5"

I want to add the following heuristic to a pattern.jsonl file that will annotate tokens that come after the word 'at', but not the word 'at' itself. So the following won't work:

{'label': 'time', 'pattern':[{'LOWER': {'IN':['at']}}, {LOWER: {'REGEX':"\\d"}}]}

since it will match

"at 5"

rather than just

"5"

With normal regular expressions I could use the lookaround semantics, but not sure how to translate that into Prodigy's pattern semantics. Any advice welcome. Thanks

Hi! This currently isn't possible using spaCy's matcher out-of-the-box. But you could either write some logic around it that removes the first token of the match for cases like this, or use regular expressions on the "text" directly (like this).

To use your custom matcher in Prodigy, all you need is a function that takes the example, matches on the "text" and adds a "spans" property to the example where each span is a dict with a "start" and "end" (character offsets) and a "label".

this works nicely...thanks

1 Like