Recipe choice for NER Annotated Dataset Creation

Hi,

I am having trouble choosing my recipe based on https://prodi.gy/docs/named-entity-recognition#workflow and https://prodi.gy/prodigy_flowchart_ner-36f76cffd9cb4ef653a21ee78659d366.pdf.

The situation is:

  • Unannotated raw text (100s of documents with 50-100 entities per document)
  • Entity types/classes not available in pretrained models (domain specific entities, thus also concerned about Out of Vocabulary tokens)
  • Need to annotate the whole corpus of text efficiently

I thought about 2 ways I could do it:

  1. ner.manual -> ner.batch_train -> ner.teach
  2. ner.manual -> ner.batch_train -> ner.correct

I am not sure which way would be the most efficient to annotate the whole data with %100 accuracy.

Any recommendations and help is welcome!

Thank you in advance.

1 Like

Hi! Of course, it's difficult to give a definitive answer, because it all depends on your use case, data etc. But from what you describe, I'd personally lean towards starting with 2. That approach is similar to the one I'm showing in my NER video.

Some of the reasoning behind it:

  • If you're starting completely from scratch with new categories, it's always good to have a bit more gold-standard data, annotated with all entities. Both ner.manual and ner.correct give you that.
  • After the first training session, it might still be unclear how good your model already is. ner.correct gives you a more straightforward look at the actual outputs here and you typically get a good feeling for what you're mostly correcting and where the problems are. (This is a lot more subtle if you're using a binary workflow and are annotating samples based on multiple possible analyses.)
  • You're saying your goal is to annotate the whole corpus of text? That's something you're not going to get if you're asking the model to pick what's relevant to annotate and skip what isn't (like it would be the case with ner.teach). So ner.correct seems like a better compromise here, once you've pretrained a model to predict something: you can use the model to help you label, but you're still working through all examples.

This is a great and ambitious goal for data quality, but it's most likely unrealistic :stuck_out_tongue: You're going to make mistakes, and even if it's not actual mistakes, there are always going to be edge cases that are very difficult to annotate consistently.

Thank you for the very informative answer @ines!

You're saying your goal is to annotate the whole corpus of text?

Yes exactly, so my plan was ner.manual 5-10 documents fully and then use the model trained on those 5 documents to help me annotate the remaining documents in a more efficient way.

once you've pretrained a model to predict something : you can use the model to help you label, but you're still working through all examples.

I am ok with browsing through all documents and fixing annotations here and there but definitely don't want to annotate from zero :grinning:

I will go with option #2 then, thank you!

1 Like