Creating patterns library from scratch

toxtoth · August 17, 2021, 7:00pm

Hi all, i'm wanting to create a patterns library manually from scratch but i don't think there is an existing recipe for this exact use case; unless i'm missed it??

I know i can use a pre-existing patterns before starting an ner.manual annotation session to pre-select text, but i want to:

load sample text and start prodigy session
select tokens and label them
save those tokens to the patterns library as I am progressing through the session
have any previously saved patterns pre-selected in the same annotation session

i.e. the annotator is shown pre-selected patterns that they have already annotated in that session, so they know they have already created a pattern and don't need to annotate that text again

any thoughts?

ines · August 18, 2021, 12:39am

Hi! There's no built-in workflow for this, but you should be able to implement something like it by adapting a recipe like ner.manual and adding an update callback that updates your matcher from spans annotated in the data.

See this thread for a pretty similar approach (and some considerations for how to handle the batching):

Instead of going via the PatternMatcher, you might want to use the Matcher or PhraseMatcher directly, which removes one layer of abstraction. When setting the "spans" on the incoming examples, just make sure you're filtering overlaps (e.g. with spaCy's filter_spans utility) so you don't end up with overlapping matches. Alternatively, you could also use the new spans_manual UI, which lets you create and show overlapping spans. This could be pretty useful, actually, because it'll let you view potential conflicts and ambiguous patterns.

In your case, you'd just have to decide whether you want your patterns to be exact string matches (text[span["start"]:span["end"]]), or if you want to tokenize them so you can have a list of tokens like [{"lower": "foo"}, {"lower": "bar"}]. That kinda depends on your use case.

In your recipe, you could also add an on_exit callback that exports all patterns in the matcher to a file once you exit the annotation session, and some logic on load that sets up the matcher from the current dataset (for when you want to resume a session). You could also do the export this as a separate command that takes the name of the dataset and exports a patterns file based on the annotations.

toxtoth · August 18, 2021, 12:39pm

Thank you Ines for your very fast and detailed response! I will give it a go and thank you for your help

Topic		Replies	Views
NER automatically update patterns ner	2	332	February 2, 2023
Pattern matching feature request enhancement , solved	7	2265	February 9, 2018
Prodigy Custom Model; Model in the Loop (matcher) usage , ner , solved	2	753	August 10, 2021
Adding patterns to entity ruler in the loop usage , ner , spacy	6	1006	September 22, 2021
Use patterns.jsonl to automatically annotate entire dataset spancat	6	515	October 20, 2022

Creating patterns library from scratch

Related topics