Duplicate entity annotations

ner

(John) #1

I found some doubly annotated spans in our data from prodigy.

{'text': 'Play alternative rock.',
 '_input_hash': 415108119,
 '_task_hash': -1690765751,
 'tokens': [{'text': 'Play', 'start': 0, 'end': 4, 'id': 0},
  {'text': 'alternative', 'start': 5, 'end': 16, 'id': 1},
  {'text': 'rock', 'start': 17, 'end': 21, 'id': 2},
  {'text': '.', 'start': 21, 'end': 22, 'id': 3}],
 'spans': [{'start': 5,
   'end': 16,
   'token_start': 1,
   'token_end': 1,
   'label': 'genre',
   'answer': 'accept'},
  {'start': 5,
   'end': 21,
   'token_start': 1,
   'token_end': 2,
   'label': 'genre',
   'answer': 'accept'}],
 'answer': 'accept'}

We only use ner.manual and ner.make-gold. Do you have any idea of how this could have happened? I thought that there were mechanisms to not bring duplicates up for re-annotation. Can you tell us what prodigy would make of such data when training? I know spacy throws an error when you insert overlapping spans…

Sincere thanks!


(Ines Montani) #2

Hi! Thanks for the report – could you share more details on how the data was created? Did you run ner.manual and ner.make-gold over the same data in the same dataset? Did you have multiple people annotating and if so, how did you set this up?

And where does the particular example your posted come from? Was this in the datset, or in the merged training data? It’d definitely confusing that there are two conflicting spans in the same task.

Yes, by default, this is done by comparing the _task_hash, which represents the input (like the raw text), plus any existing annotations (highlighted spans you’re collecting feedback on, labels etc.). So basically, by default, Prodigy allows different questions about the same input but not the exact same question about the same input. This is important, because if you’re using recipes like ner.teach that suggest single spans, you do want to be asked about more than one option per example.

But this also means that a raw task in ner.manual and a pre-annotated task in ner.make-gold would receive different task hashes for the same input. So this is one potential explanation for what happened here.

During training, it’ll just ignore them and treat the token tags as unknown values, because there’s no way to represent the spans with token-based BILUO tags. So examples like this shouldn’t have a negative impact on the model – it just won’t learn anything from them.

If the example you posted came from the merged training data, this would mean that the internal helper Prodigy uses to merge spans doesn’t actually do anything if spans overlap and just accepts them, because it’s clear that they’ll be ignored during training anyways. I’m not 100% sure what the best solution would be in that case – it’d probably be good to have an option to raise an explicit error instead, or maybe a separate command to validate the data and output warnings for conflicting or otherwise problematic annotations (kinda like the experimental debug-data in spacy-nightly).


(John) #3

I work for a medium-sized startup, so many people are involved with using our data.

So our procedure is to train an initial model with ner.manual and an initial text file and an NER dataset. Once a basic model is available we annotate with a second text file and the initial model in a make-gold session with the same dataset. I see that this query was included once in teach of those text files, so perhaps that is why the hash of the first input example (with its annotation) did not match during the make-gold session with the second file?

I ran an experiment which involved annotating the same query: once with ner.manual and then once with ner-make gold. Each with different tags. This resulted in two JSON entries in the dataset:

prodigy db-out duplicate_test
{“text”:“Play alternative music.”,"_input_hash":-1439002215,"_task_hash":1687991231,“tokens”:[{“text”:“Play”,“start”:0,“end”:4,“id”:0},{“text”:“alternative”,“start”:5,“end”:16,“id”:1},{“text”:“music”,“start”:17,“end”:22,“id”:2},{“text”:".",“start”:22,“end”:23,“id”:3}],“spans”:[{“start”:5,“end”:16,“token_start”:1,“token_end”:1,“label”:“genre”}],“answer”:“accept”}
{“text”:“Play alternative music.”,"_input_hash":-1439002215,"_task_hash":-840412699,“tokens”:[{“text”:“Play”,“start”:0,“end”:4,“id”:0},{“text”:“alternative”,“start”:5,“end”:16,“id”:1},{“text”:“music”,“start”:17,“end”:22,“id”:2},{“text”:".",“start”:22,“end”:23,“id”:3}],“spans”:[{“start”:5,“end”:22,“token_start”:1,“token_end”:2,“label”:“genre”}],“answer”:“accept”}

It seems that is not what happened since they ended up as two entries. There were some changes in the entity labels. I tried replicating changing entity labels, but they ended up as two entries as before with the same exact hashes as above for input and task…

Let me know if you have more thoughts about what the cause could be.

Data validation sounds like a very useful tool.

Thanks again!


(Ines Montani) #4

Thanks for the details! And it sounds like you have good processes in place already.

If you annotate the same example twice – once with ner.manual and once with ner.make-gold, it’s definitely possible that it shows up and gets annotated again. The example you posted looks consistent with Prodigy’s hashing – the _input_hash is identical, meaning that the annotations are on the same text, but the _task_hash is different, meaning that the initial suggested annotations were different (e.g. none for ner.manual and some in ner.make-gold). And Prodigy will compare on the task hash, not the input.

If you know that you only want to see an input once, you can also add your own filter functions to filter on input hashes instead (see the README for the helper function for that). Just make sure you’re only doing this for specific, manual recipes – if you added something like that to ner.teach, it’d make the recipe very ineffective.

Just to make double-sure I understand this correctly and can investigate: The example you shared in your first post with the inconsistent spans in the same task, did that come straight from a dataset in db-out? Or did that come from the final training data, e.g. what’s imported after training as training.jsonl?


(John) #5

The first and second posts contain what came out of db-out.

Thanks for the explanation of the hashes. With that explanation, I can see why two people running sessions is a prime suspect. Let me know if I can be of further use.