Mixing in gold data to avoid catastrophic forgetting

This is actually a nice idea – maybe Prodigy should expose a helper function for this. The conversion is pretty straightforward, because the annotation tasks already include everything you need. You can also access the Prodigy database from Python, so you won't even have to export your dataset to a file. The (untested) code below should give you spaCy training data in the "simple training style", i.e. with annotations provided as a dictionary:

from prodigy.components.db import connect

def dataset_to_train_data(dataset):
    db = connect()
    examples = db.get_dataset(dataset)
    train_data = []
    for eg in examples:
        if eg['answer'] == 'accept':
            entities = [(span['start'], span['end'], span['label'])
                        for span in eg['spans']]
            train_data.append((eg['text'], {'entities': entities}))
    return train_data

The above example will just export each entry in the dataset as a separate training example. If you want to reconcile spans on the same input that were separated into individual tasks, you can use the input hash:

from prodigy.util import INPUT_HASH_ATTR # == '_input_hash', but using the constant is nicer

You can then check eg[INPUT_HASH_ATTR] and merge the spans of annotated examples that refer to the same input.

6 Likes