Creating patterns library from scratch

Hi all, i'm wanting to create a patterns library manually from scratch but i don't think there is an existing recipe for this exact use case; unless i'm missed it??

I know i can use a pre-existing patterns before starting an ner.manual annotation session to pre-select text, but i want to:

  • load sample text and start prodigy session
  • select tokens and label them
  • save those tokens to the patterns library as I am progressing through the session
  • have any previously saved patterns pre-selected in the same annotation session

i.e. the annotator is shown pre-selected patterns that they have already annotated in that session, so they know they have already created a pattern and don't need to annotate that text again

any thoughts?

Hi! There's no built-in workflow for this, but you should be able to implement something like it by adapting a recipe like ner.manual and adding an update callback that updates your matcher from spans annotated in the data.

See this thread for a pretty similar approach (and some considerations for how to handle the batching):

Instead of going via the PatternMatcher, you might want to use the Matcher or PhraseMatcher directly, which removes one layer of abstraction. When setting the "spans" on the incoming examples, just make sure you're filtering overlaps (e.g. with spaCy's filter_spans utility) so you don't end up with overlapping matches. Alternatively, you could also use the new spans_manual UI, which lets you create and show overlapping spans. This could be pretty useful, actually, because it'll let you view potential conflicts and ambiguous patterns.

In your case, you'd just have to decide whether you want your patterns to be exact string matches (text[span["start"]:span["end"]]), or if you want to tokenize them so you can have a list of tokens like [{"lower": "foo"}, {"lower": "bar"}]. That kinda depends on your use case.

In your recipe, you could also add an on_exit callback that exports all patterns in the matcher to a file once you exit the annotation session, and some logic on load that sets up the matcher from the current dataset (for when you want to resume a session). You could also do the export this as a separate command that takes the name of the dataset and exports a patterns file based on the annotations.

Thank you Ines for your very fast and detailed response! I will give it a go and thank you for your help