Merging annotations from different datasets

Each example that comes in is assigned an _input_hash – for text-based tasks, that hash is generated from the "text" property. So you can easily check whether annotations refer to the same input text by comparing their input hashes, and then merging their "spans".

The NER model comes with a helper function merge_spans that should do exactly that – however, it's currently only used internally during updating and we haven't really tested its isolated usage yet. But you can try the following:

from prodigy.components.db import connect
from prodigy.models.ner import merge_spans

db = connect()  # connect to the DB using the prodigy.json settings
datasets = ['dataset_one', 'dataset_two', 'dataset_three']
examples = []
for dataset in datasets:
    examples += db.get_dataset(dataset)  # get examples from the database

merged_examples = merge_spans(examples)

The merged_examples should now include the examples with spans merged if the input is identical. You can then inspect the list, save it out to a file, or even add it back to the database. To keep things clean, you probably also want to rehash the examples and assign new input and task hashes based on the new properties.

from prodigy import set_hashes

merged_dataset = [set_hashes(eg, overwrite=True) for eg in merged_dataset]
db.add_dataset('merged_dataset')
db.add_examples(mergex_examples, datasets=['merged_dataset'])

Instead of fetching the examples from the database in Python, you could also use the db-out command to download them as JSONL and then read them in. You can also use the ner.print-dataset recipe to preview the sets on the command line.

The easiest way would probably be to export your dataset as JSONL, and load it into the ner.teach recipe.

prodigy ner.make-gold your_dataset en_core_web_sm exported_dataset.jsonl --label NEW_ENTITY

Since the recipe also supports reading in a stream from stdin, and db-out writes to stdout if no output file is specified, you should also be able to pipe the output forward, for extra convenience (untested, but should work):

prodigy db-out your_dataset | prodigy ner.make-gold your_new_dataset en_core_web_sm --label NEW_ENTITY

Important note: When using ner.teach, keep in mind that the selection of examples for each entity will be biased based on the model's predictions. This is good, because it helps you annotate only the most relevant examples. But it also means that the examples selected for one entity type might not necessarily be the best examples to annotate for a different entity type. So I'd only recommend the above approach for non-active-learning recipes like ner.make-gold or ner.manual.

I'm not sure I understand your question correctly – could you give an example?

3 Likes