Task hash of ner.make-gold and ner.silver-to-gold should be same?

einarbmag · August 29, 2019, 12:26pm

I have a datasource that I initially did some ner.match and ner.teach on, and then used ner.silver-to-gold to create gold data. Now I added some data to the datasource and wanted to go directly to gold by using ner.make-gold, but realised that Prodigy then started suggesting already annotated tasks. This must be because the task hash of ner.silver-to-gold is different than ner.make-gold... but the end product is identical (gold annotated examples), wouldn't it be better to have the same task hash so that we don't end up duplicating work?

ines · August 29, 2019, 1:38pm

Yes, this sounds reasonable – I'll take a look! The solution might be as simple as forcing a rehashing on the incoming examples. For example:

examples = [prodigy.set_hashes(eg, overwrite=True) for eg in examples]

einarbmag · August 30, 2019, 9:15am

Cool, thanks! Just to confirm my understanding, would it be that the task hash is different due to different 'options' parameters between the ner.silver-to-gold and ner.make-gold? (referring to the set_hashes documentation)

So, the task hashes of the ner.silver-to-gold annotations are obviously determined already... you're saying I'd have to dump the already annotated data, create a recipe to read it back in and in that recipe I rehash? And then use same rehashing in an extension of ner.make-gold?

ines · August 30, 2019, 10:27am

Ah, I just realised I might have misread your initial question: The task hash reflects the input text plus the annotations (highlighted entity, label). For instance, if your text is "hello world" and one example suggests "hello" as an entity, and the other "world", those examples will have different task hashes. Their input hashes will be the same, though.

So in your case, the input hash would be a better criterion to use: if a text is already in the data, don't ask about it again, even if the annotations on it might be different (suggestion proposes entities but isn't 100% correct, correct example is already in the set).

The silver-to-gold workflow currently assumes that all you have is "silver" data and that you don't already have partial gold-standard annotations. But you could edit the recipe slightly and add a filter function that checks if an example's input hash is already in the dataset and only sends it out if it's not:

from prodigy.components.db import connect
from prodigy import set_hashes

input_hashes = db.get_input_hashes(["your_dataset"])

def filter_stream(stream):
    for eg in stream:
        eg = set_hashes(eg, overwrite=True)
        if eg["_input_hash"] not in input_hashes:
            yield eg

einarbmag · September 2, 2019, 9:31am

Thanks @ines, that makes sense!

Topic		Replies	Views
Duplicate examples shown after restarting server done	4	1136	January 17, 2022
ner silver-to-gold resulted in annotating the same objects multiple times bug , ner	3	815	December 13, 2021
Difference in quality in make-gold vs trained model's annotations (and others) ner	1	599	August 10, 2018
Difference between Input hash and task hash database	1	1910	July 22, 2020
teach versus silver versus gold usage , ner	1	446	December 16, 2019

Task hash of ner.make-gold and ner.silver-to-gold should be same?

Related topics