I have a datasource that I initially did some ner.match and ner.teach on, and then used ner.silver-to-gold to create gold data. Now I added some data to the datasource and wanted to go directly to gold by using ner.make-gold, but realised that Prodigy then started suggesting already annotated tasks. This must be because the task hash of ner.silver-to-gold is different than ner.make-gold... but the end product is identical (gold annotated examples), wouldn't it be better to have the same task hash so that we don't end up duplicating work?
Yes, this sounds reasonable – I'll take a look! The solution might be as simple as forcing a rehashing on the incoming examples. For example:
examples = [prodigy.set_hashes(eg, overwrite=True) for eg in examples]
Cool, thanks! Just to confirm my understanding, would it be that the task hash is different due to different 'options' parameters between the ner.silver-to-gold
and ner.make-gold
? (referring to the set_hashes
documentation)
So, the task hashes of the ner.silver-to-gold annotations are obviously determined already... you're saying I'd have to dump the already annotated data, create a recipe to read it back in and in that recipe I rehash? And then use same rehashing in an extension of ner.make-gold?
Ah, I just realised I might have misread your initial question: The task hash reflects the input text plus the annotations (highlighted entity, label). For instance, if your text is "hello world" and one example suggests "hello" as an entity, and the other "world", those examples will have different task hashes. Their input hashes will be the same, though.
So in your case, the input hash would be a better criterion to use: if a text is already in the data, don't ask about it again, even if the annotations on it might be different (suggestion proposes entities but isn't 100% correct, correct example is already in the set).
The silver-to-gold workflow currently assumes that all you have is "silver" data and that you don't already have partial gold-standard annotations. But you could edit the recipe slightly and add a filter function that checks if an example's input hash is already in the dataset and only sends it out if it's not:
from prodigy.components.db import connect
from prodigy import set_hashes
input_hashes = db.get_input_hashes(["your_dataset"])
def filter_stream(stream):
for eg in stream:
eg = set_hashes(eg, overwrite=True)
if eg["_input_hash"] not in input_hashes:
yield eg
Thanks @ines, that makes sense!