Prodigy using the model instead of the patterns during ner.teach

I was using ner.teach to learn to recognize date entities in particular contexts. I use date matching patterns as my seed. What I’ve seen in the past is that Prodigy starts by proposing candidates that exactly match the dates and then only after you have gone through a lot of those will it start making suggestions from the model, e.g. what Prodigy is supposed to do.

I tried this on a new corpus, and the candidates proposed by Prodigy start off coming from the model. Since the model is untrained, they’re essentially random. It looks like the patterns are never used.

I don’t understand how this could be happening. I’m running everything exactly the same way as before. The only thing that’s different is the corpus. The only thing odd is that this new corpus is tiny (order of 100 candidate entities).

Does this sound like a bug, or is there some corner case for small corpora that would compel Prodigy to use the model instead of the patterns?

The most likely explanation would be that no matches or not enough matches are found in the corpus or the respective batches. If ner.teach is used with patterns, the model and pattern matcher are combined, and the results (matches and predictions from the model) are merged using the toolz.interleave function.

In an ideal case, that would look like this (numbers representing a result):

from_patterns = [1, 2, 3, 4, 5]
from_model = [6, 7]
list(interleave((from_patterns, from_model)))
# [1, 6, 2, 7, 3, 4, 5]

However, if the patterns don’t produce a result in that batch, the combined models will only output the model’s predictions.

A simple solution could be to increase the "batch_size", either in your recipe’s 'config' or your prodigy.json. Larger batches mean more potential for pattern matches. As a little sanity check, you might also want to try and run spaCy’s Matcher or PhraseMatcher over a portion of your corpus using the patterns you’ve created – just to verify that it indeed includes matches, and that the matcher isn’t thrown off by different tokenization etc.

Makes sense. Thanks.