Ballpark estimate on amounts of annotations needed for custom NER training?

Hello! I'm new to the ropes of the Explosion suite, very impressive set of tools you guys have.

I'm currently comparing this suite to the one offered by Amazon for NER. Amazon lists some pretty hard requirements for NER transfer learning, at "200 samples per entity" (I am guessing that means per custom entity).

I noticed that there is a tool in Prodigy for estimating annotation requirements, that would be based on your dataset and would obviously be more realistic than abstract numbers.

If an abstract number is at all possible to estimate, do you guys have one? Data requirements for annotation would be a big part of the decision as to which platform to use. If that kind of estimate would make no sense, just let me know :slight_smile:.

Thanks!

Alex

Hi and thanks! :slightly_smiling_face:

This question is very difficult to answer, because it totally depends on your use case and data, how you're setting up your label scheme (bad label scheme can mean the model learns very little from a lot of data), and of course which model and pretrained weights you're using. Prodigy itself is pretty agnostic here because you can set it up however you like and use any pretrained model to pre-label examples for you.

So I'm not sure if it makes sense to think of "number of training examples required" as a feature of an annotation tool or platform – at least, it's kinda weird, because the number of examples required depends on the model you're training and what you are doing with the data.

We typically recommend doing at least a few hundred annotations, as this gives you enough data to run a meaningful evaluation (don't forget the evaluation data, that's just as important as the training data). Annotating data manually is pretty quick if you set it up efficiently and use a pretrained model to help with the labelling. In case you haven't seen it yet, this video shows a full end-to-end approach for training an NER model from scratch. The annotation part took only ~2.5 hours and I ended up with 949 annotations in total.

Thank you!