data-to-spacy losing annotations

koaning · October 31, 2023, 2:12pm

Let me try something else then. First, I'll try to recreate your situation by annotating this data.

{"text": "hello my name is james"}
{"text": "hello my name is john"}
{"text": "hello my name is robert"}
{"text": "hello my name is michael"}
{"text": "hello my name is william"}
{"text": "hello my name is mary"}
{"text": "hello my name is david"}
{"text": "hello my name is richard"}
{"text": "hello my name is joseph"}

I'm doing these two recipe calls to generate two datasets. One with names, the other with greetings.

python -m prodigy ner.manual issue-6864-names blank:en examples.jsonl --label name
python -m prodigy ner.manual issue-6864-greeting blank:en examples.jsonl --label greeting

The interfaces look like this.

For names

For greetings

So that means that right now I have a dataset titled issue-6854-names and another issue-6864-greeting that share input hashes but still have a different label attached. This is confirmed by db-out.

However, I noticed something interesting in the db-out calls. I'm listing the last item from both sets.

From `python -m prodigy db-out issue-6864-greeting`:

{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"_is_binary":false,"tokens":[{"text":"hello","start":0,"end":5,"id":0,"ws":true},{"text":"my","start":6,"end":8,"id":1,"ws":true},{"text":"name","start":9,"end":13,"id":2,"ws":true},{"text":"is","start":14,"end":16,"id":3,"ws":true},{"text":"joseph","start":17,"end":23,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":0,"end":5,"token_start":0,"token_end":0,"label":"greeting"}],"answer":"accept","_timestamp":1698746894,"_annotator_id":"2023-10-31_11-08-03","_session_id":"2023-10-31_11-08-03"}

From `python -m prodigy db-out issue-6864-namess`:

{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"_is_binary":false,"tokens":[{"text":"hello","start":0,"end":5,"id":0,"ws":true},{"text":"my","start":6,"end":8,"id":1,"ws":true},{"text":"name","start":9,"end":13,"id":2,"ws":true},{"text":"is","start":14,"end":16,"id":3,"ws":true},{"text":"joseph","start":17,"end":23,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":17,"end":23,"token_start":4,"token_end":4,"label":"name"}],"answer":"accept","_timestamp":1698746872,"_annotator_id":"2023-10-31_11-07-38","_session_id":"2023-10-31_11-07-38"}

Notice how the task hashes are the same too? That's why my "trick" didn't work. I would've expected them to differ, but I'll explore that later because I want to unblock you first.

Might a script like this work instead? I'm thinking that we ignore the hashes for now and just manually make sure that each example has the appropriate spans attached.

from prodigy.components.db import connect 

# Fetch the one dataset that has all your examples, the one you
# created with the db-merge command
db = connect()
dataset_names = ["issue-6864-names", "issue-6864-greeting"]
old_examples = []
for name in dataset_names:
    old_examples.extend(db.get_dataset_examples(name))

def dedup_spans(example):
    key_values = {}
    for span in example['spans']:
        key = (span['start'], span['end'], span['label'])
        if key not in key_values:
            key_values[key] = []
        key_values[key] = span
    return [item for item in key_values.values()]

def merge(examples):
    """We need an extra function because dictionaries aren't hashable."""
    key_values = {}
    for ex in examples:
        if ex['_input_hash'] not in key_values:
            key_values[ex['_input_hash']] = ex
        else:
            key_values[ex['_input_hash']]['spans'].extend(ex['spans'])
    for ex in examples:
        ex['spans'] = dedup_spans(ex)
    return examples

# These new examples should now have the spans merged and deduplicated.
new_examples = (merge(old_examples))

Let me know! I'll gladly help you get unblocked if this doesn't work. On my end it seems that if I run this I do get examples that look like they contain all the spans that were annotated.

This is what the final example in new_examples looks like on my end.

{'text': 'hello my name is james', '_input_hash': -1294982232, '_task_hash': 465224705, '_is_binary': False, 'tokens': [{'text': 'hello', 'start': 0, 'end': 5, 'id': 0, 'ws': True}, {'text': 'my', 'start': 6, 'end': 8, 'id': 1, 'ws': True}, {'text': 'name', 'start': 9, 'end': 13, 'id': 2, 'ws': True}, {'text': 'is', 'start': 14, 'end': 16, 'id': 3, 'ws': True}, {'text': 'james', 'start': 17, 'end': 22, 'id': 4, 'ws': False}], '_view_id': 'ner_manual', 'spans': [{'start': 17, 'end': 22, 'token_start': 4, 'token_end': 4, 'label': 'name'}, {'start': 0, 'end': 5, 'token_start': 0, 'token_end': 0, 'label': 'greeting'}], 'answer': 'accept', '_timestamp': 1698746861, '_annotator_id': '2023-10-31_11-07-38', '_session_id': '2023-10-31_11-07-38'}

Topic		Replies	Views
Losing spancat labels when training after using prodigy db-merge spacy , spancat	12	339	January 3, 2024
Data annotation : Query Regarding Data Annotation and Merging in Prodigy ner	1	18	January 10, 2025
Review into the same dataset (v1.11.04a) usage , review	1	446	March 12, 2021
Training Multiple entities at the Same time? ner , spacy , solved	11	3177	November 27, 2018
combining multiple models and exporting training data to spacy ner , spacy	3	2881	November 13, 2018

data-to-spacy losing annotations

For names

For greetings

From python -m prodigy db-out issue-6864-greeting:

From python -m prodigy db-out issue-6864-namess:

Related topics

From `python -m prodigy db-out issue-6864-greeting`:

From `python -m prodigy db-out issue-6864-namess`: