Let me try something else then. First, I'll try to recreate your situation by annotating this data.
{"text": "hello my name is james"}
{"text": "hello my name is john"}
{"text": "hello my name is robert"}
{"text": "hello my name is michael"}
{"text": "hello my name is william"}
{"text": "hello my name is mary"}
{"text": "hello my name is david"}
{"text": "hello my name is richard"}
{"text": "hello my name is joseph"}
I'm doing these two recipe calls to generate two datasets. One with names, the other with greetings.
python -m prodigy ner.manual issue-6864-names blank:en examples.jsonl --label name
python -m prodigy ner.manual issue-6864-greeting blank:en examples.jsonl --label greeting
The interfaces look like this.
For names
For greetings
So that means that right now I have a dataset titled issue-6854-names
and another issue-6864-greeting
that share input hashes but still have a different label attached. This is confirmed by db-out
.
However, I noticed something interesting in the db-out
calls. I'm listing the last item from both sets.
From python -m prodigy db-out issue-6864-greeting
:
{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"_is_binary":false,"tokens":[{"text":"hello","start":0,"end":5,"id":0,"ws":true},{"text":"my","start":6,"end":8,"id":1,"ws":true},{"text":"name","start":9,"end":13,"id":2,"ws":true},{"text":"is","start":14,"end":16,"id":3,"ws":true},{"text":"joseph","start":17,"end":23,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":0,"end":5,"token_start":0,"token_end":0,"label":"greeting"}],"answer":"accept","_timestamp":1698746894,"_annotator_id":"2023-10-31_11-08-03","_session_id":"2023-10-31_11-08-03"}
From python -m prodigy db-out issue-6864-namess
:
{"text":"hello my name is joseph","_input_hash":467632156,"_task_hash":-1782316920,"_is_binary":false,"tokens":[{"text":"hello","start":0,"end":5,"id":0,"ws":true},{"text":"my","start":6,"end":8,"id":1,"ws":true},{"text":"name","start":9,"end":13,"id":2,"ws":true},{"text":"is","start":14,"end":16,"id":3,"ws":true},{"text":"joseph","start":17,"end":23,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":17,"end":23,"token_start":4,"token_end":4,"label":"name"}],"answer":"accept","_timestamp":1698746872,"_annotator_id":"2023-10-31_11-07-38","_session_id":"2023-10-31_11-07-38"}
Notice how the task hashes are the same too? That's why my "trick" didn't work. I would've expected them to differ, but I'll explore that later because I want to unblock you first.
Might a script like this work instead? I'm thinking that we ignore the hashes for now and just manually make sure that each example has the appropriate spans attached.
from prodigy.components.db import connect
# Fetch the one dataset that has all your examples, the one you
# created with the db-merge command
db = connect()
dataset_names = ["issue-6864-names", "issue-6864-greeting"]
old_examples = []
for name in dataset_names:
old_examples.extend(db.get_dataset_examples(name))
def dedup_spans(example):
key_values = {}
for span in example['spans']:
key = (span['start'], span['end'], span['label'])
if key not in key_values:
key_values[key] = []
key_values[key] = span
return [item for item in key_values.values()]
def merge(examples):
"""We need an extra function because dictionaries aren't hashable."""
key_values = {}
for ex in examples:
if ex['_input_hash'] not in key_values:
key_values[ex['_input_hash']] = ex
else:
key_values[ex['_input_hash']]['spans'].extend(ex['spans'])
for ex in examples:
ex['spans'] = dedup_spans(ex)
return examples
# These new examples should now have the spans merged and deduplicated.
new_examples = (merge(old_examples))
Let me know! I'll gladly help you get unblocked if this doesn't work. On my end it seems that if I run this I do get examples that look like they contain all the spans that were annotated.
This is what the final example in new_examples
looks like on my end.
{'text': 'hello my name is james', '_input_hash': -1294982232, '_task_hash': 465224705, '_is_binary': False, 'tokens': [{'text': 'hello', 'start': 0, 'end': 5, 'id': 0, 'ws': True}, {'text': 'my', 'start': 6, 'end': 8, 'id': 1, 'ws': True}, {'text': 'name', 'start': 9, 'end': 13, 'id': 2, 'ws': True}, {'text': 'is', 'start': 14, 'end': 16, 'id': 3, 'ws': True}, {'text': 'james', 'start': 17, 'end': 22, 'id': 4, 'ws': False}], '_view_id': 'ner_manual', 'spans': [{'start': 17, 'end': 22, 'token_start': 4, 'token_end': 4, 'label': 'name'}, {'start': 0, 'end': 5, 'token_start': 0, 'token_end': 0, 'label': 'greeting'}], 'answer': 'accept', '_timestamp': 1698746861, '_annotator_id': '2023-10-31_11-07-38', '_session_id': '2023-10-31_11-07-38'}