Most efficient way to train/label having separate manual and binary data


For a NER project with only 2 entities starting from scratch (not using existing entities) I have so far created 3 datasets:
1: 480 annotations using ner.manual of ONLY entity A
2: 300 annotations using ner.manual of ONLY entity B (only 80 sentences overlapped with 1)
3: 1150 binary annotation using ner.teach of ONLY entity B, using the model created by training on dataset 1 and 2 (all settings default without any flags)

Needles to say with so little annotations I want to spend more time annotating. I also noticed binary annotating was significantly (>5x) faster than manual (even with patterns).

My questions are:

  • How should I proceed labeling? It seems most time-efficient to collect more binary annotations using ner.teach for entity A, then more for B (but not both at the same time). Should they ideally be on the same sentences?
  • Can having such large separate binary datasets for A and B be the end point, or should I convert them to gold via silver-to-gold or ner.correct (which destroys the efficiency gain of binary annotations)?
  • How should I perform both final and intermediate training?
    ** Should I simply train the base (NER-less) model with all datasets (binary and manual), or first a model with the manual, then update that with the binary?
    ** Given that I want to perform binary annotations of only 1 entity at the same time, should I periodically retrain and switch entities to prevent the model becoming one-sided?
    ** which flags to use when? (i can at least choose -B, -NM, -E and a combination)

Sorry if these questions seem basic, I could figure out the answers for the simple case of a pure GOLD + Binary dataset but since my datasets are quite separate/different (no true GOLD exists since datasets 1 and 2 are not overlapping), and I want to do binary labeling for A and B separately, the answers might be different.

Thanks a lot in advance!

Hi! These are all good questions and it's always good to consider these questions explicitly for each project :slightly_smiling_face:

The annotations don't necessarily have to be on the same sentences, although it's usually good to have at least some overlap. Otherwise, you can more easily end up with imbalanced data, and you'll also never have examples of texts with multiple different entities, which could mean that there's less useful information for the model to learn from, and more unknowns.

Binary annotations can definitely be very useful for moving your model into a better direction and correcting and as you say, they're really efficient to create and often include specific examples that the model can get the most value out of.

That said, if you're training a new model from scratch, it's usually good to focus on a reasonably sized corpus of complete annotations as an end goal, and you can use your binary annotations to create an intermediate model to help with that. Instead of converting you binary annotations, another thing you could do is train a model using the data you already have, and then use it with ner.correct. If your model is pretty good already, this can also be extremely fast, because you only have to correct what the model gets wrong. So you can easily build up a very large corpus of complete, gold-standard annotations without having to do much manual labelling at all.

In Prodigy v1.10, the training mechanism for manual and binary works requires different logic, so you'll have to run the training separately. Ideally, you'd start with the manual annotations first, because those give you more complete information for the model to learn from, especially if you're starting from scratch.

In Prodigy v1.11 (currently available as a nightly pre-release), you'll be able to train from both manual and binary annotations jointly, and it's also something we'd recommend. You'll be able to get better results if your annotations inclue at least some complete examples, on top of the binary decisions.

When you're training the model, you should ideally train on all binary datasets together. Prodigy will take care of merging all annotations on the same input, so if a sentence contains binary annotations for two labels, the model will be updated with both of this information together.

The --ner-missing flag is really only intended for non-binary annotations where you want to consider all unannotated tokens as "missing values" (as opposed to "not an entity", which is typically the default). This is already included when you train from --binary, because binary annotations always mean that you only know the answer for one particular token sequence, and nothing about all other unannotated tokens.

Hi Ines,
Thanks a lot for the detailed response! :slight_smile:

It seems from your response my best course of action would for example be:

  • Collect more (~1100) more binary annotations for entity A, to make sure I have similarly sized datasets for A and B
  • Potentially update dataset 1 (manually annotated on only A), and add the entity B in those sentences to create more overlap, using review with -l "both,labels"
  • Train a model using above generated datasets (ideally using Prodigy 1.11 w/ combined binary and manual training).
  • Use the trained model with ner.correct to generate a larger corpus of gold annotations w/ entities A and B (hopefully the model is accurate enough to enable speedy labeling).

I've just now applied for the nightly pre-release so look forward to seeing the new functionalities in action!

1 Like