I've got the same dataset annotated in multiple sessions, one session per entity. Now I'd like to merge the data so that there's one dataset whose spans include all the entities found without duplicating the text. Following Merging annotations from different datasets, I hoped that
#!/usr/bin/python3 from prodigy.components.db import connect from prodigy import set_hashes db = connect() # uses the settings in your prodigy.json datasets = ['ner-test-PRODUCT', 'ner-test-LOCATION'] # names of your datasets merged_examples =  for dataset in datasets: examples = db.get_dataset(dataset) merged_examples += examples merged_dataset = [set_hashes(eg, overwrite=True) for eg in merged_examples] db.add_dataset('ner-test') # however you want to call the new set db.add_examples(merged_examples, datasets=['ner-test-merged'])
but the resulting dataset looks like a simple concatenation. Each text is duplicated, once for each input data set. The task_hash is common for the duplicates, but the input_has is not. Is there something I can do other than export the data and merge the spans with my own code?