Prodigy Custom Model; Model in the Loop (matcher)

Hi! I'm looking for an opinion on the best way to implement the following NER annotation workflow (i.e. related to "model in the loop").


We have N annotated notes that we'd like to pass through Prodigy to remove/add further annotations. The preset annotations are just regexes and we passed them to Prodigy via the spans field in a JSONL file. Now, when we are reannotating these notes, we'd like the annotations we do to be applied to future notes in a batch. For example, say as we're reannotating these notes, we notice that "car" isn't being annotated. Since we would have annotated "car" in the first note, we'd like all future instances of "car" to be highlighted as an annotation so we don't have to redo it.

Our current workflow is based on the "model in the loop" idea: we'd like to update our model after a batch of notes and rerun all unprocessed notes through this model. This idea normally applies to BERT models, but we'd like to apply it for the PatternMatcher instance.

Below is pseudocode for an implementation idea.

    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    patterns=("Optional match patterns", "option", "p", str)
def ner_regex(
    dataset: str, 
    spacy_model: str,
    source: str,
    label: Optional[List[str]] = None,
    patterns: Optional[List[str]] = None):

    nlp = spacy.blank("en")
    matcher = PatternMatcher(nlp).from_disk(patterns)

    update = model.update

    stream = JSONL(source)                 
    stream = add_tokens(nlp, stream)   

    predict = model
    stream = (eg for score, eg in predict(stream))

    return {
        "view_id": "ner_manual",    # annotation interface to use
        "dataset": dataset,  # dataset to save annotations to
        "stream": stream,    # the incoming stream of examples
        "update": update,    # the update callback
        "config": {
            "lang": "en",
            "labels": label

The idea is that the unannotated examples would be updated once the user saves the annotations (calls the update func). However, this code doesn't work, and we believe we may have to implement a custom model. Additionally, we don't fully understand how the batch_size parameter works (in this case, we'd like to be 1: the unannotated examples get updated after each annotation).

Any feedback and help will be greatly appreciated, thank you!

Hi! This is definitely a cool idea :+1: And yes, for that to work you should implement your own function that adds annotated spans to the matcher as patterns. The PatternMatcher currently only sets the pattern-based annotations and is updated with the accept/reject information to update the scores assigned to the patterns (so you can use it in an active learning context where you filter based on certain/uncertain predictions). It doesn't add annotated spans to the patterns, since this isn't always what you want in an annotation workflow.

So you could do something like this in your update callback, get the text of all annotated spans and add them to the matcher via PatternMatcher.add_patterns:

def update(answers):
    patterns = set()
    for eg in answers:
        for span in eg.get("spans", []):
            # Get the text of each annotated span given its offsets
            span_text = eg["text"][span["start"]:span["end"]]
            patterns.add({"pattern": span_text, "label": span["label"]})

The above code is the very simple example so you can see how it works. You probably want to make this more elegant and keep a record of all patterns (label / pattern) combinations you already have, so you're only adding a pattern if you don't already know it. Another idea would be to implement some more fine-grained heuristics for deciding whether to add a pattern: maybe you only want to include it once you have X annotations with a given span, instead of only one.

You could also consider implementing this directly via spaCy's Matcher or PatternMatcher, which removes one layer of abstraction. Alternatively, if you're already using your own logic with regular expressions, you could also update your regex based on the annotated text spans, and then use that in the function that processes the stream.

One thing to keep in mind is that Prodigy will typically queue up examples in the background as you annotate, so you don't have to wait between annotations. So even with a batch size 1, example 2 will already be requested from the stream while you annotate example 1, and so on. Prodigy will also keep one batch in the app so you can easily go back and undo, without ending up with multiple conflicting versions on the back-end if you made a mistake during annotation. So there'll always be a small delay of 2 * batch_size before the examples you've annotated hit your update callback.

IMO, this is an okay trade-off for the solution you want to implement, because you'll likely have an uneven distribution of entities anyway and the same span won't be present in every example. So it may happen that you annotate "car" in example 1, see example 2 with "car" that wasn't pre-labeled, keep annotating and then see "car" pre-labeled in example 4 and any future examples.

(Btw, if you are using the PatternMatcher, you can also set the batch_size when you call it. This is usually less relevant because you often want to buffer pattern matches in the background, but in your case, you want to set this lower to match your batch size. Again, it might be easier to just call spaCy's Matcher or PhraseMatcher, or use a regex-based approach here to remove some complexity.)

Thank you for your response! We ended up implementing the Matcher class with a batch_size of 1 and a few minor changes, and it works! :smiley:

1 Like