Preanotate entities with patterns


I want to create an annotations jsonl file using patterns with es_core_news_lg pipeline. In this case, I'm trying to annotate MONEY entities using a jsonl file I have created with the patterns.

There's any possibility to do that? Probably using custom recipes? I tried using get_stream for a data jsonl file and PatternMatcher.from_disk for the patterns jsonl file (there is the code below). But the problem is that if the text has more than one entity, it creates different json's in the jsonl preannotation file. How can I solve the problem?

import spacy
import prodigy
from prodigy.components import printers
from prodigy.components.loaders import get_stream
from prodigy.core import recipe, recipe_args
from prodigy.models.matcher import PatternMatcher
from prodigy.util import log
import json

        patterns=('Path to match patterns file', 'positional'),
def print_pattern_stream(spacy_model, patterns, source=None, api=None, loader=None):
    log("RECIPE: Starting recipe ner.preannotate-patterns", locals())
    model = PatternMatcher(spacy.load(spacy_model)).from_disk(patterns)
    stream = get_stream(source, api, loader, rehash=True, input_key='text')
    with open('preannotations_money.jsonl', 'w') as file:
        for line in model(stream):
            json_line = json.dumps(line[1])

Hi! From what you describe, it sounds like you could solve this by using ner.manual with --patterns:

Prodigy supports both token-based and extract string patterns (like the ones used by spaCy's PatternMatchers), so your file could look like this:

{"label": "MONEY", "pattern": "€123"}
{"label": "MONEY", "pattern": [{"IS_CURRENCY": true}, {"LIKE_NUM": true}]}

Just a quick note on how you would solve this if you wanted to implement this in a custom recipe – although you shouldn't have to because ner.manual has you covered :slightly_smiling_face:

Under the hood, the pre-annotation works like this:

  • Stream in all your input examples and load the patterns into a matcher.
  • For each example in the stream, process it with spaCy and match your patterns on it.
  • If relevant: Make sure that you filter the matches so you don't end up with overlapping spans. You can do this using spaCy's filter_spans utility.
  • Add a list of "spans" to each example containing the matches. spaCy's matcher will let you create Span objects for each match, so you can do {"start": span.start_char, "end": span.end_char, "label": "MONEY"} for each span.
  • Add tokens your stream using Prodigy's add_tokens helper.
1 Like