Merging datasets of same input data to combine separately annotated entities

DGMS90 · February 14, 2025, 12:02pm

Hi,

I've annotated two identical datasets containing 5,000 texts. I annotated a different non-overlapping entity type on each respective iteration of the 5,000 texts.

I'm now wanting to merge the datasets to have a combined set of 5,000 texts with both entity labels applied.

When I use db-merge, the resulting dataset seems to just be the 5,000 texts x 2 i.e. it's been appended to itself.

I assumed the input hashes might have changed somewhere between annotating the two sets, so I tried db-out'ing the datasets, then db-in'ing them with a new name and applying -R to force reshash and then combining them. This did not work either.

Is there a way to combine my datasets without having to manually add the second entity label to one of the labelled sets?

FYI, I have six entities that I would like to apply in total, but having completed the first two I wanted to test the theory before continuing.

Thanks in advance!

Darren

magdaaniol · February 14, 2025, 4:02pm

Hi @DGMS90 ,

The merge_db command actually just concatenates the dataset - it does not merge the annotated spans on the same input_hash so what you're observing is correct.

Prodigy is only merging the annotations before training with train and exporting with data-to-spacy. These two commands also take care of resolving the conflicting annotations e.g. overlapping spans by selecting the longer one. So you might as well store your datasets separately and only merge when you're ready to train.

You can of course merge it yourself, if that's preferred. In this post Ines provides some code snippets for this that should be helpful
and some more relevant comments here.

DGMS90 · February 17, 2025, 8:58am

Thank you so much!

Topic		Replies	Views
Data annotation : Error in merge datasets ner	5	23	January 10, 2025
`db-merge` concatenates rather than merges usage , ner	3	10	July 8, 2025
using merge_spans to combine manual NER spans of different entities in different sessions ner	1	863	March 21, 2020
Merging annotations from different datasets usage , ner , database , solved	12	5885	May 28, 2019
data-to-spacy losing annotations ner	11	469	January 7, 2024

Merging datasets of same input data to combine separately annotated entities

Related topics