Merging datasets of same input data to combine separately annotated entities

Hi,

I've annotated two identical datasets containing 5,000 texts. I annotated a different non-overlapping entity type on each respective iteration of the 5,000 texts.

I'm now wanting to merge the datasets to have a combined set of 5,000 texts with both entity labels applied.

When I use db-merge, the resulting dataset seems to just be the 5,000 texts x 2 i.e. it's been appended to itself.

I assumed the input hashes might have changed somewhere between annotating the two sets, so I tried db-out'ing the datasets, then db-in'ing them with a new name and applying -R to force reshash and then combining them. This did not work either.

Is there a way to combine my datasets without having to manually add the second entity label to one of the labelled sets?

FYI, I have six entities that I would like to apply in total, but having completed the first two I wanted to test the theory before continuing.

Thanks in advance!

Darren :slightly_smiling_face:

Hi @DGMS90 ,

The merge_db command actually just concatenates the dataset - it does not merge the annotated spans on the same input_hash so what you're observing is correct.

Prodigy is only merging the annotations before training with train and exporting with data-to-spacy. These two commands also take care of resolving the conflicting annotations e.g. overlapping spans by selecting the longer one. So you might as well store your datasets separately and only merge when you're ready to train.

You can of course merge it yourself, if that's preferred. In this post Ines provides some code snippets for this that should be helpful
and some more relevant comments here.

1 Like

Thank you so much! :pray: