I've got the same dataset annotated in multiple sessions, one session per entity. Now I'd like to merge the data so that there's one dataset whose spans include all the entities found without duplicating the text. Following Merging annotations from different datasets, I hoped that
#!/usr/bin/python3
from prodigy.components.db import connect
from prodigy import set_hashes
db = connect() # uses the settings in your prodigy.json
datasets = ['ner-test-PRODUCT',
'ner-test-LOCATION'] # names of your datasets
merged_examples = []
for dataset in datasets:
examples = db.get_dataset(dataset)
merged_examples += examples
merged_dataset = [set_hashes(eg, overwrite=True) for eg in merged_examples]
db.add_dataset('ner-test') # however you want to call the new set
db.add_examples(merged_examples, datasets=['ner-test-merged'])
but the resulting dataset looks like a simple concatenation. Each text is duplicated, once for each input data set. The task_hash is common for the duplicates, but the input_has is not. Is there something I can do other than export the data and merge the spans with my own code?
Thanks!