Prodigy Custom Model; Model in the Loop (matcher)

Hi! This is definitely a cool idea :+1: And yes, for that to work you should implement your own function that adds annotated spans to the matcher as patterns. The PatternMatcher currently only sets the pattern-based annotations and is updated with the accept/reject information to update the scores assigned to the patterns (so you can use it in an active learning context where you filter based on certain/uncertain predictions). It doesn't add annotated spans to the patterns, since this isn't always what you want in an annotation workflow.

So you could do something like this in your update callback, get the text of all annotated spans and add them to the matcher via PatternMatcher.add_patterns:

def update(answers):
    patterns = set()
    for eg in answers:
        for span in eg.get("spans", []):
            # Get the text of each annotated span given its offsets
            span_text = eg["text"][span["start"]:span["end"]]
            patterns.add({"pattern": span_text, "label": span["label"]})
    matcher.add_patterns(patterns)

The above code is the very simple example so you can see how it works. You probably want to make this more elegant and keep a record of all patterns (label / pattern) combinations you already have, so you're only adding a pattern if you don't already know it. Another idea would be to implement some more fine-grained heuristics for deciding whether to add a pattern: maybe you only want to include it once you have X annotations with a given span, instead of only one.

You could also consider implementing this directly via spaCy's Matcher or PatternMatcher, which removes one layer of abstraction. Alternatively, if you're already using your own logic with regular expressions, you could also update your regex based on the annotated text spans, and then use that in the function that processes the stream.

One thing to keep in mind is that Prodigy will typically queue up examples in the background as you annotate, so you don't have to wait between annotations. So even with a batch size 1, example 2 will already be requested from the stream while you annotate example 1, and so on. Prodigy will also keep one batch in the app so you can easily go back and undo, without ending up with multiple conflicting versions on the back-end if you made a mistake during annotation. So there'll always be a small delay of 2 * batch_size before the examples you've annotated hit your update callback.

IMO, this is an okay trade-off for the solution you want to implement, because you'll likely have an uneven distribution of entities anyway and the same span won't be present in every example. So it may happen that you annotate "car" in example 1, see example 2 with "car" that wasn't pre-labeled, keep annotating and then see "car" pre-labeled in example 4 and any future examples.

(Btw, if you are using the PatternMatcher, you can also set the batch_size when you call it. This is usually less relevant because you often want to buffer pattern matches in the background, but in your case, you want to set this lower to match your batch size. Again, it might be easier to just call spaCy's Matcher or PhraseMatcher, or use a regex-based approach here to remove some complexity.)