Merging annotations from different datasets


(Michael Higgins) #1

As a newb, I annotated 3 datasets for three different entities, street, zip and state. The datasets have shared text. I would like to now merge them so they can learn from the other labels. Is there an easy way to go about this?

Does ner.teach and ner.match provide samples from the db and from the input jsonl file?
How can I make it so only samples from the db are given to annotate (with a new entity)?

(Ines Montani) #2

Each example that comes in is assigned an _input_hash – for text-based tasks, that hash is generated from the "text" property. So you can easily check whether annotations refer to the same input text by comparing their input hashes, and then merging their "spans".

The NER model comes with a helper function merge_spans that should do exactly that – however, it’s currently only used internally during updating and we haven’t really tested its isolated usage yet. But you can try the following:

from prodigy.components.db import connect
from prodigy.models.ner import merge_spans

db = connect()  # connect to the DB using the prodigy.json settings
datasets = ['dataset_one', 'dataset_two', 'dataset_three']
examples = []
for dataset in datasets:
    examples += db.get_dataset(dataset)  # get examples from the database

merged_examples = merge_spans(examples)

The merged_examples should now include the examples with spans merged if the input is identical. You can then inspect the list, save it out to a file, or even add it back to the database. To keep things clean, you probably also want to rehash the examples and assign new input and task hashes based on the new properties.

from prodigy import set_hashes

merged_dataset = [set_hashes(eg, overwrite=True) for eg in merged_dataset]
db.add_examples(mergex_examples, datasets=['merged_dataset'])

Instead of fetching the examples from the database in Python, you could also use the db-out command to download them as JSONL and then read them in. You can also use the ner.print-dataset recipe to preview the sets on the command line.

The easiest way would probably be to export your dataset as JSONL, and load it into the ner.teach recipe.

prodigy ner.make-gold your_dataset en_core_web_sm exported_dataset.jsonl --label NEW_ENTITY

Since the recipe also supports reading in a stream from stdin, and db-out writes to stdout if no output file is specified, you should also be able to pipe the output forward, for extra convenience (untested, but should work):

prodigy db-out your_dataset | prodigy ner.make-gold your_new_dataset en_core_web_sm --label NEW_ENTITY

Important note: When using ner.teach, keep in mind that the selection of examples for each entity will be biased based on the model’s predictions. This is good, because it helps you annotate only the most relevant examples. But it also means that the examples selected for one entity type might not necessarily be the best examples to annotate for a different entity type. So I’d only recommend the above approach for non-active-learning recipes like ner.make-gold or ner.manual.

I’m not sure I understand your question correctly – could you give an example?

(Michael Higgins) #3

Thanks for all the help, I appreciate the care you put into your answers!

Is there a similarly simple way to filter out spans/texts that include a particular label? If I am training a model on a set of labels that is a subset of all the annotated labels a keyError is thrown. Is this the correct behavior?

For example I have ZIP, STREET, STATE, PROD labels in my db ----- If I train on ZIP, STREET, STATE then I get the error: KeyError: ‘U-PROD’

Nevermind, I think I get it. All the annotated examples come from the text examples jsonl file not the db. It is possible that an incoming text is already annotated (and in the db) but this will only happen if you are pulling examples from the same jsonl over multiple sessions or if the model is still uncertain after annotating(in the same session). Right?

(Ines Montani) #4

This is slightly unideal default behaviour, and we’ll fix this for the next release. The labels will then always be read off the dataset, so you shouldn’t see the KeyError anymore.

But you can always write your own filter functions – for example, something like this:

def filter_examples(examples, exclude=tuple()):
    for eg in examples:
        filtered_spans = [span for span in eg['spans'] 
                          if span['label'] not in exclude]
        if filtered_spans:  # only include example if there are spans left
            eg['spans'] = filtered_spans
            return eg

examples = filter_examples(examples, exclude=('PROD', 'OTHER_LABEL'))