Merging annotations from different datasets

As a newb, I annotated 3 datasets for three different entities, street, zip and state. The datasets have shared text. I would like to now merge them so they can learn from the other labels. Is there an easy way to go about this?

Does ner.teach and ner.match provide samples from the db and from the input jsonl file?
How can I make it so only samples from the db are given to annotate (with a new entity)?

2 Likes

Each example that comes in is assigned an _input_hash – for text-based tasks, that hash is generated from the "text" property. So you can easily check whether annotations refer to the same input text by comparing their input hashes, and then merging their "spans".

The NER model comes with a helper function merge_spans that should do exactly that – however, it's currently only used internally during updating and we haven't really tested its isolated usage yet. But you can try the following:

from prodigy.components.db import connect
from prodigy.models.ner import merge_spans

db = connect()  # connect to the DB using the prodigy.json settings
datasets = ['dataset_one', 'dataset_two', 'dataset_three']
examples = []
for dataset in datasets:
    examples += db.get_dataset(dataset)  # get examples from the database

merged_examples = merge_spans(examples)

The merged_examples should now include the examples with spans merged if the input is identical. You can then inspect the list, save it out to a file, or even add it back to the database. To keep things clean, you probably also want to rehash the examples and assign new input and task hashes based on the new properties.

from prodigy import set_hashes

merged_dataset = [set_hashes(eg, overwrite=True) for eg in merged_dataset]
db.add_dataset('merged_dataset')
db.add_examples(mergex_examples, datasets=['merged_dataset'])

Instead of fetching the examples from the database in Python, you could also use the db-out command to download them as JSONL and then read them in. You can also use the ner.print-dataset recipe to preview the sets on the command line.

The easiest way would probably be to export your dataset as JSONL, and load it into the ner.teach recipe.

prodigy ner.make-gold your_dataset en_core_web_sm exported_dataset.jsonl --label NEW_ENTITY

Since the recipe also supports reading in a stream from stdin, and db-out writes to stdout if no output file is specified, you should also be able to pipe the output forward, for extra convenience (untested, but should work):

prodigy db-out your_dataset | prodigy ner.make-gold your_new_dataset en_core_web_sm --label NEW_ENTITY

Important note: When using ner.teach, keep in mind that the selection of examples for each entity will be biased based on the model's predictions. This is good, because it helps you annotate only the most relevant examples. But it also means that the examples selected for one entity type might not necessarily be the best examples to annotate for a different entity type. So I'd only recommend the above approach for non-active-learning recipes like ner.make-gold or ner.manual.

I'm not sure I understand your question correctly – could you give an example?

3 Likes

Thanks for all the help, I appreciate the care you put into your answers!

Is there a similarly simple way to filter out spans/texts that include a particular label? If I am training a model on a set of labels that is a subset of all the annotated labels a keyError is thrown. Is this the correct behavior?

For example I have ZIP, STREET, STATE, PROD labels in my db ----- If I train on ZIP, STREET, STATE then I get the error: KeyError: 'U-PROD'

Nevermind, I think I get it. All the annotated examples come from the text examples jsonl file not the db. It is possible that an incoming text is already annotated (and in the db) but this will only happen if you are pulling examples from the same jsonl over multiple sessions or if the model is still uncertain after annotating(in the same session). Right?

This is slightly unideal default behaviour, and we'll fix this for the next release. The labels will then always be read off the dataset, so you shouldn't see the KeyError anymore.

But you can always write your own filter functions – for example, something like this:

def filter_examples(examples, exclude=tuple()):
    for eg in examples:
        filtered_spans = [span for span in eg['spans'] 
                          if span['label'] not in exclude]
        if filtered_spans:  # only include example if there are spans left
            eg['spans'] = filtered_spans
            return eg

examples = filter_examples(examples, exclude=('PROD', 'OTHER_LABEL'))

Does the model from ner.batch-train learn from multiple annotations of the same span? After merge_spans, the data could look something like this:

{
    '_input_hash': blabla,
    '_task_hash': blabla,
    'text': 'I like London and Berlin.',
    'spans': [
        {'start': 7, 'end': 13, 'label': 'YOLO', 'answer': 'reject'},
        {'start': 7, 'end': 13, 'label': 'YOLO', 'answer': 'accept'},
    ],
}

If the answer does not agree, I was thinking it would best to learn from both signals by not merging.

The model currently ignores examples with conflicting annotations, as there’s no way for us to guess which policy you’d prefer it to follow. If you have a lot of conflicting annotations, you should pre-process the dataset to resolve the conflicts, e.g. by trusting the latest annotation.

1 Like

Hi,

I got a naive question. I have trained a new model for recognising a new entity type. The reason I did this was because using the existing models spacy provides, the entities that were recognised were organisation, people and location – which I also require for my application. What was not being recognised was the plant entity - which can be recognised through a new model created through prodigy.
Therefore, my question is - do I have to train brand new models for recognising people, location, organisations - and merge the datasets ? is there no way to combine an existing model with a new customised model???

Vatsala

Hi,

Regarding the merging of annotations together, can you provide some more details as to how to do this process. I am sorry the related thread is not too clear as to how to do this- for example, you mentioned use ner.teach command, but the displayed code uses recipe ner.make-gold. Also I suppose the mergex_examples is a typo?? It should read merged_examples.

Also what do you mean by new entity in this line?
prodigy db-out your_dataset | prodigy ner.make-gold your_new_dataset en_core_web_sm --label NEW_ENTITY

Vatsala

could you let me know what do you mean of

connect to the DB using the prodigy.json settings

if I do not still update my prodigy.

can you let me know if I update my prodigy how can I merge my three daset same datasets with different labels

Your prodigy.json lets you configure your database settings. By default, it uses an SQLite database in your local ~/.prodigy directory. You can find all the details and configuration options in your PRODIGY_README.html. If you use Prodigy’s db.connect helper, it will connect to the database using whichever database settings you have configured.

If you’re using Prodigy v1.8+, you can use the built-in db-merge command to merge datasets. You can find details here or in your README. When you train, the examples and annotated spans will be merged automatically.

my prodigy,json is empty file.
since I still did not updated, I want to first this

db = connect(db = connect('sqlite', {'name': '?????'}))  # uses the settings in your prodigy.json
                                     
datasets = ['an_ner_date_01', 'an_ner_time_01']
merged_examples = []
for dataset in datasets:
    examples = db.get_dataset(dataset)
    merged_examples += examples

db.add_dataset('ner_merged')  # however you want to call the new set
db.add_examples(merged_examples, datasets=['ner_merged'])

I do not what should I put inside db, I am reading PRODIGY_README.html
still could not find it out the answer

You should just be able to call db = connect(). This is usually all you need – unless you use a super custom database setup.

1 Like

Dear Ines,

Now it is working, It seems that I could merge “corrected per-annotation text”.

now, after merging, when I want to read merged file by

python -m prodigy ner.manual an_ner_date_astr_01 en_core_web_sm  AN_NER_DATE_ASTR_01.jsonl --label ASTR,DATE

It only shows the lable DATE. It does not show me label ASTR. However, when I opened file “merge_dataset” I can see theere is lable “ASTR” also

I am working on that and I would be happy to know your thought also

Many thanks