Lookingaround with Prodigy / Spacy matcher semantics

Superscope · October 6, 2019, 12:41pm

I want to identify times in a dialogue. So the natural text might be something like:

"I will meet you at 5"

I want to add the following heuristic to a pattern.jsonl file that will annotate tokens that come after the word 'at', but not the word 'at' itself. So the following won't work:

{'label': 'time', 'pattern':[{'LOWER': {'IN':['at']}}, {LOWER: {'REGEX':"\\d"}}]}

since it will match

"at 5"

rather than just

"5"

With normal regular expressions I could use the lookaround semantics, but not sure how to translate that into Prodigy's pattern semantics. Any advice welcome. Thanks

ines · October 7, 2019, 12:56pm

Hi! This currently isn't possible using spaCy's matcher out-of-the-box. But you could either write some logic around it that removes the first token of the match for cases like this, or use regular expressions on the "text" directly (like this).

To use your custom matcher in Prodigy, all you need is a function that takes the example, matches on the "text" and adds a "spans" property to the example where each span is a dict with a "start" and "end" (character offsets) and a "label".

Superscope · October 7, 2019, 2:06pm

this works nicely...thanks

Topic		Replies	Views
(Re)using labels in patterns usage , spacy	1	316	July 21, 2021
Pattern Matcher OR usage , spacy , off-topic	1	442	December 20, 2020
Use patterns.jsonl to automatically annotate entire dataset spancat	6	512	October 20, 2022
Can't get phrase matching to work spancat	3	295	June 27, 2023
a question about regular expression usage , spacy , solved	5	943	December 5, 2022

Lookingaround with Prodigy / Spacy matcher semantics

Related topics