using merge_spans to combine manual NER spans of different entities in different sessions

I've got the same dataset annotated in multiple sessions, one session per entity. Now I'd like to merge the data so that there's one dataset whose spans include all the entities found without duplicating the text. Following Merging annotations from different datasets, I hoped that

from prodigy.components.db import connect
from prodigy import set_hashes

db = connect()  # uses the settings in your prodigy.json                                                                                                       
datasets = ['ner-test-PRODUCT',
            'ner-test-LOCATION']  # names of your datasets                                                                                    

merged_examples = []
for dataset in datasets:
    examples = db.get_dataset(dataset)
    merged_examples += examples

merged_dataset = [set_hashes(eg, overwrite=True) for eg in merged_examples]
db.add_dataset('ner-test')  # however you want to call the new set                                                                                          
db.add_examples(merged_examples, datasets=['ner-test-merged'])

but the resulting dataset looks like a simple concatenation. Each text is duplicated, once for each input data set. The task_hash is common for the duplicates, but the input_has is not. Is there something I can do other than export the data and merge the spans with my own code?


Yes, that's correct – in the code, you're only concatenating the lists of the examples, so the result will just be a list of all examples. It's equivalent to what's now also available as the built-in db-merge. Internally, Prodigy will merge examples before training (e.g. with train) or when you export the data with data-to-spacy.

If you want to do it yourself in a script or some other process, you can use the _input_hash to determine whether annotations refer to the same text and use that to group the annotations together. In the most straightforward scenario, you'd keep one example per input hash and then concatenate all the spans of accepted (not rejected or ignored) examples.

One thing to consider is how to handle conflicts. A token can only be part of one entitiy, and in an ideal scenario, there should be no conflicts between your different entity labels. But in theory, there could be. So you could either adopt a policy for that (always accept the longest span, prefer label A over label B), or you can flag those and resolve the conflict manually.