`db-merge` concatenates rather than merges

jhandsel · July 3, 2025, 10:31am

I’m training a NER model with several entity types. I tagged each entity type as a separate task, so I now have several versions of my dataset, each annotated with a different entity type.

I want to merge the datasets so I have a single dataset with all the annotations. However, running prodigy db-merge concatenates the datasets rather than merging.

When I train a NER model on the resulting dataset, Prodigy drops most of the samples and created a model with the first NER tag found.

Is this expected behaviour?

magdaaniol · July 3, 2025, 4:07pm

Welcome to the forum @jhandsel !

Indeed, the merge_db command just concatenates the dataset - it does not merge the annotated spans on the same input_hash so what you're observing after running db-mergeis correct.

Prodigy is only merging the annotations before training with train and exporting with data-to-spacy. These two commands also take care of resolving the conflicting annotations e.g. overlapping spans by selecting the longer one. So you might as well store your datasets separately and only merge when you're ready to train.

Now the reason why some annotations are being ignored could be that the annotations produced in different annotation rounds resulted in overlapping spans which is not allowed in NER (each token can only belong to one entity). If you need a custom conflict resolution logic you'd need to merge the spans via custom function. For this you want to process your merged dataset by grouping examples by _input_hash and merging all spans to single list:

from prodigy.components.db import connect
from prodigy.models.ner import merge_spans

db = connect()  # connect to the DB using the prodigy.json settings
datasets = ['dataset_one', 'dataset_two', 'dataset_three']
examples = []
for dataset in datasets:
    examples += db.get_dataset_examples(dataset)  # get examples from the database

merged_examples = merge_spans(examples)

Now you could process merged examples by trying to create spaCy entities from span annotations, if there's a conflict spaCy will raise en error. You can use a function similar to this one:

def check_span_conflicts(example, nlp):
    doc = nlp.make_doc(example['text'])
    # Create spaCy spans from Prodigy spans
    spans = []
    for span in example.get('spans', []):
        spacy_span = doc.char_span(span['start'], span['end'], label=span['label'])
        if spacy_span:
            spans.append(spacy_span)
        else:
            raise ValueError(f"Span could not be created. Span offsets misaligned with the tokenization in example with input hash {example.get('_input_hash'}")
        
    # Try to set entities - this will fail if there are conflicts
    try:
        doc.set_ents(spans)
    except Exception as e:
        raise ValueError(f"Conflicting spans detected in example with input hash {example.get('_input_hash'}")

Once you've detected problematic examples you can add a logic for custom conflict resolution.

If what you need is overlapping spans, you might consider span categorization instead.

jhandsel · July 4, 2025, 6:43am

Thanks for confirming the behaviour of db-merge.

There aren't many cases of overlapping spans in my annotations, so I'm not sure why train failed to auto-merge. I am working with Japanese data — could this be an issue?

I've noticed that other components don't behave as expected with Japanese. For example, the review interface sometimes considers identically annotated examples as distinct. And the output of print-dataset has sentences chopped and shifted, with some tags missing.

As a workaround, I managed to combine my datasets using a script similar to the one you provided:

from prodigy import set_hashes
from prodigy.components.db import connect
from prodigy.models.ner import merge_spans

db = connect()  # connect to the DB using the prodigy.json settings
datasets = ['dataset_one', 'dataset_two', 'dataset_three']
examples = []
for dataset in datasets:
    examples += db.get_dataset(dataset)

merged_examples = merge_spans(examples)

rehashed_examples = [set_hashes(eg, overwrite=True) for eg in merged_examples]
db.add_dataset('merged_dataset')
db.add_examples(rehashed_examples, datasets=['merged_dataset'])

I was able to train a model with on the output that included all of the expected NER categories.

magdaaniol · July 8, 2025, 8:29am

No, if the tokenizer used during the annotation is the same as the one used for the train command, the language should not matter. It's not only the overlapping spans that would be a problem for NER. The misaligned spans (spans offsets not corresponding to the tokenization) would also be rejected.
Perhaps you were using ner.manual with the --highlight-chars option? That could definitely have introduced misaligned spans as it doesn't retokenize the example (as per docs here).

. And the output of print-dataset has sentences chopped and shifted, with some tags missing.

print-dataset prints exactly what's in spans - it should even print correctly the overlapping and misaligned spans.
Could you provide an example of such "chopped and shifted" example with "tags missing"? The source dataset entry and a screen shot how looks like when printed with print-dataset would be great - thank you.

For example, the review interface sometimes considers identically annotated examples as distinct.

I'm afraid I'm not reproducing this problem either. I'm suspecting there must be differences in spans such as inclusion/exclusion of the whitespace which are not easily visible. Since it seems to be data related issue, again, an example of an identical annotation that renders wrong in review would be great.

Topic		Replies	Views
using merge_spans to combine manual NER spans of different entities in different sessions ner	1	873	March 21, 2020
Merging annotations from different datasets usage , ner , database , solved	12	5925	May 28, 2019
Merging datasets of same input data to combine separately annotated entities usage , ner	2	36	February 17, 2025
Merging annotation models? usage , ner , solved	2	747	August 4, 2019
Training Multiple entities at the Same time? ner , spacy , solved	11	3202	November 27, 2018

`db-merge` concatenates rather than merges

Related topics