Training a model on both gold and binary data

Thanks for your response @honnibal, that all makes sense. I can see how you’d want flexibility on the “completeness” definition. I really think it would be a huge improvement though to be able to pass two datasets to the training recipe, one of which would be treated as complete (the most strict definition) and one treated as binary. While we’re there, it would actually be nice to be able to pass multiple datasets instead of having to merge them.

But ok, you’re saying that currently if I want to do this, I add a no_missing field to the complete examples like this? :

{
    "text": "Apple updates its analytics service with new metrics",
    "label": "CORRECT",
    "spans": [{
        "start": 0,
        "end": 5,
        "label": "ORG"
    }],
    "no_missing": true
}

That’s a reasonable workflow for the complete data that I already have and have to load into the Prodigy database anyway (and so adding the field is a single line of code), would be good to have a nice solution for using both gold and binary data that I’ve annotated in Prodigy, without having to extract, add the field, and add it back to the database. But that’s not such a big deal, database manipulations aren’t that much work.

Next task for me is to see whether I can reach the same accuracy when training my NER model in Prodigy as I do with spaCy. I took my old, complete, “accept” data, generated an equal amount of “reject” data, and chucked it in ner.batch-train, accuracy only around 70% (I had F1 score ~95% in spaCy, on slightly homogeneous generated data). But that’s something I should probably read more about and play around with before asking for help.