Training a new Pattern

is there any empirical analysis done on how many new samples are required compared to the corpus sized trained to train a new pattern? As we deploy prodigy models in production, it would be an important metric to ensure certainty of outcomes. I will do some benchmark on this, wanted to understand if you there is any historical analysis form this perspective

Assuming by pattern you mean entity type: There’s not really a way to say, because it depends on how hard the learning problem is. Some things to consider:

  • How common is the entity?
  • How diverse are the instances?
  • How ambiguous are the instances?

If you have an entity that’s made up of only a single word, and that word is common, and that word is always an entity, any model will learn this super quickly. On the other hand, if you’re trying to tag long phrases with huge surface variation, and whether the phrase is an entity depends on context, you’ll need a lot of examples.

The best advice we’ve been able to give is to plot out a dose/response curve of data vs accuracy. This is implemented in the ner.train-curve recipe.

When I meant pattern here, I mean for text classification and not Entity or term recognition.

Also, I see that Spacy is integrated with https://hazyresearch.github.io/snorkel/. Could be useful to create synthetic dataset using snorkel.