Training a model on both gold and binary data

Hello,
I’ve just acquired Prodigy with the plan to improve data annotation for an already existing chatbot NLU app (NER as well as intent / text classification). Meaning, I started out with procedurally created gold/complete training data (~30 intents, ~10 entity types, some overlap with the pre-trained spaCy model (ORG, PERSON, LOC …), ~10000 examples) and train/fine-tune a spaCy NER model. I have then been using a custom annotation loop where I just take the lowest confidence user logs and laboriously correct the full annotations in a home-made sub-par interface. I was hoping to use Prodigy and its accept/reject approach to speed this up dramatically.

The more I look into the forum and documentation, it seems that training a model with both gold and binary training data isn’t properly supported? Using ner.batch-train, it seems I have to specify that the data is either one or the other, using the --no-missing flag. And there doesn’t seem to be any easy way to use spaCy with binary training data.

Is there any way to use both types of data effectively, or is this anywhere on the Prodigy roadmap?

You’re right that this isn’t well exposed at the recipe level at the moment. However, you can attach a field no_missing to your examples that indicates whether they treat the annotations as gold-standard.

We’re not fully satisfied with the usability of the "no_missing" flags, but they were the best interim solution we could put in place. We hope this can be improved for Prodigy 2.0. I’ll give you a bit of backstory about how we’re thinking about this, and why the problem is a bit tricky.

We’ve opted to use a pretty minimal data model in Prodigy so far, to keep the task storage very flexible. The database views the annotation tasks as opaque blobs, which are associated to dataset IDs. Of course, there’s a trade-off here. The disadvantage of being able to store any type of data is you don’t get a type system. A key piece of information we’re missing is that we don’t really have a great way of denoting the provenance of the examples, and how complete their annotations are.

The truth about what’s complete and correct can be much more subtle than just a binary flag. You could use NER annotations that are “complete” with respect to entity types A, B and C, but not some new entity type D. The underlying training algorithm could support this level of detail, but the implementation gets more complicated, and it gets harder to communicate what’s going on to users. Already the mechanism by which the model learns from the incomplete information is fairly subtle, so we’re reluctant to make it even more complicated.

Thanks for your response @honnibal, that all makes sense. I can see how you’d want flexibility on the “completeness” definition. I really think it would be a huge improvement though to be able to pass two datasets to the training recipe, one of which would be treated as complete (the most strict definition) and one treated as binary. While we’re there, it would actually be nice to be able to pass multiple datasets instead of having to merge them.

But ok, you’re saying that currently if I want to do this, I add a no_missing field to the complete examples like this? :

{
    "text": "Apple updates its analytics service with new metrics",
    "label": "CORRECT",
    "spans": [{
        "start": 0,
        "end": 5,
        "label": "ORG"
    }],
    "no_missing": true
}

That’s a reasonable workflow for the complete data that I already have and have to load into the Prodigy database anyway (and so adding the field is a single line of code), would be good to have a nice solution for using both gold and binary data that I’ve annotated in Prodigy, without having to extract, add the field, and add it back to the database. But that’s not such a big deal, database manipulations aren’t that much work.

Next task for me is to see whether I can reach the same accuracy when training my NER model in Prodigy as I do with spaCy. I took my old, complete, “accept” data, generated an equal amount of “reject” data, and chucked it in ner.batch-train, accuracy only around 70% (I had F1 score ~95% in spaCy, on slightly homogeneous generated data). But that’s something I should probably read more about and play around with before asking for help.

@honnibal @ines, could you please confirm whether this is the right syntax to specify an example as no_missing ?:

{
    "text": "Apple updates its analytics service with new metrics",
    "label": "CORRECT",
    "spans": [{
        "start": 0,
        "end": 5,
        "label": "ORG"
    }],
    "no_missing": true
}

Thanks!

1 Like

Yes, that looks right to me. Does it seem to do the right thing when you try it?

Hi @honnibal, revisiting this topic from last year now that the CLI has changed quite a bit. Is it still possible to train with both binary (silver) and gold NER training data? I.e. by making sure there is a

'no_missing':true

field in the dataset/examples that are considered gold data, and then use the training command like this for example:

prodigy train ner gold_data,silver_data en_core_web_sm --binary --ner-missing

I'm actually quite confused by the --binary flag... isn't all the training data technically "binary", i.e you either accept or reject the annotation? And isn't it the --ner-missing flag that specifies whether or not the annotation concerns only the span(s) with annotations?

(I have to admit I never properly tried using both silver and gold data last year, but now I really feel the need... I've created lots of gold data to serve as a foundation, and then I want to hand over further annotation to a less technical person, and it seems to me that ner.teach is the best way to do that now - kind of like starting from one of the pretrained spacy models and trying to improve accuracy on some entity types.)

The --binary flag uses a more complex mechanism to update the model that lets you take advantage of both the accepted and rejected suggestions of single entity spans – basically, like the default behaviour of ner.batch-train. I've written some more about it on this thread:

The --ner-missing flag lets you to specify that unannotated tokens should be treated as missing values (and not as explicitly as "not an entity"). This allows you to still train from incomplete annotations, e.g. if you only have annotations for one or two labels.

The new train command was designed to harmonise training between Prodigy and spaCy, and use spaCy's regular update mechanism to train from gold-standard data (with and without missing values). This also makes it easier to ensure that results are consistent and reproducible.

OK think I got it, makes sense to have the flexibility to set --no-missing and --binary separately, even though in most cases I assume that ner.teach -> --binary and ner.make-gold -> --no-missing.

Regarding my main question above: Is it currently feasible to train a model with a mix of gold and binary training data? Would that be done in Prodigy 1.9.5 in the way I describe above, last message?

It's possible in theory, but could be less effective – especially if your gold data annotates all entities and includes no missing values. You wouldn't be able to take advantage of this when updating the model if you train in "binary mode" and assume that all unannotated tokens are missing. The results (stats, accuracy numbers) might also be harder to reason about, because updating from accepted and rejected spans means that Prodigy is also evaluating the predictions slightly differently. So doing it in two steps (pretrain on gold, update with binary) would be more straightforward.

Yes, I thought it might be difficult to efficiently use both types of data at once... pretraining on gold and updating with binary is worth trying - maybe even a custom training loop where gold and binary model updates are alternated (in case the binary data is a bit biased for example)? I'm looking into how I could modify the training script to do this, but do you think of any immediate caveats with this? I understand it may take some balancing of how much to train with the gold vs the binary data, but is there any issue with alternating between the different training objectives repeatedly? I guess I would keep the evaluation dataset (gold) fixed.

It might be tricky, because the binary updating is more complex and happens via Prodigy's EntityRecognizer annotation model, whereas the "regular" updating just calls into spaCy's API directly. In theory, you could probably rewrite the training loop to use the regular nlp.update mechanism for the gold data and the annotation model for the binary data. But you'd have to do this in two steps... at least, I can't think of a good way to make it work for mixed batches of gold/binary.