Hello! Sorry for bombarding you with questions - I just really dug in to Prodigy details recently, have been a relatively uneducated user until now. I am creating a custom review recipe, and trying to follow the existing review recipe logic. I am missing the piece where the recipe (or stream?) decides if the task has been already reviewed and exists in the target dataset. Could you please help? I know how to do it (check the hash in the target dataset), but I want to learn best practices - where it normally happens, and if there's already some logic for it anywhere. Thanks!
Hi @nrodnova and no worries! That's what the forum is for!
The mechanism responsible for excluding the already answered questions is, precisely, filtering on the task_hash
.
It is not being done the recipe level, though.
When Controller
is being instantiated, it adds the current dataset to the set of datasets that should be used for the filtering (this default behavior can be modified by setting auto_exclude_current
to false
in prodigy.json
).
Then whenever a session
asks for a new batch of questions, the filtering is applied. So the actual filtering happens on the Session
object because it also needs to take into account the questions answered by this given session in the current process. What exactly to filter out also depends on the feed_overlap
setting (if the overlap is set to true, only the hashes from he current session are excluded, otherwise, the hashes from all sessions are excluded).
In theory, if you are replicating the hashing behavior of the original review
recipe, you shouldn't need to implement the exclusion of the already reviewed questions.
Just for completeness, whether to filter on the task or the input hash can also be configured via exclude_by
config setting. Please see here for more details: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP
Hi @magdaaniol Thank you very much for your response! It makes sense.
One more question about filtering on task_hash
. Let's assume we are talking about ner.manual
interface. If task_hash
is calculated on spans
and answers
only (we don't have options
in ner), is it really reliable to filter tasks only by span boundaries (assuming answers
are mostly "accept"
)? For short texts and large datasets it's quite likely to have same task_hash
for different input_hash
, no? What is the reasoning for not including the original text
(or anything else counted into input_hash
)?
Hi @nrodnova ,
task_hash
function does take into account the input_hash
. You might have missed this detail from the docs?
Task keys are based on the input hash, plus optional features you’re annotating, like the label or spans.
A very good observation nonetheless!
Good point!
I was looking at the set_hashes
function task_keys
parameter description:
Dictionary keys to consider for the task hash. Defaults to
("spans", "label", "options")
.
So, is it always passedinput_keys + task_keys
parameters for the task hash calculation?
Also, one more hashing question. The way I often do ner
and spans
is pre-populating spans
key with potential candidates (or sentences to look at), and then annotate with a different label (In ner
mode I'd have to delete the highlighted value and re-select it with the appropriate label, in spans
it's easier). Because I pass spans
content in the input, the _task_hash
is calculated, and it seems to not be recalculated before going to the database. So I am wondering if in your code overwrite
is False
and I need to do it in tne before_db
callback myself, or I am missing something.
Hi @nrodnova,
Yes, to be precise, it passes the already computed input_hash
as string. Then, this input_hash
string is used as prefix to the string concatenated from the values of the task_hash
keyes, which is passed to the murmurhash
function.
The core of the function is:
# pseudocode
def get_hash( task: Dict, keys: Iterable[str], prefix: str = "") -> int:
values = [
f"{key}={srsly.json_dumps(task[key], sort_keys=True)}"
for key in keys
if key in task
]
values = "".join(values)
values = str(prefix) + values # prefix would be the input_hash
hash_ = murmurhash.hash(values)
return hash_
About the default hashing in the recipes:
By default, the recipe rehashes the input stream (it's done inside the get_stream
function with rehash
set to True) and it does overwrite the hashes if they exist.
And yes, it doesn't get rehashed before saving to the DB. This mostly because the hashing logic is relevant for processing the input. When your annotations are used again in any of Prodigy recipes, they will be rehashed by default upon reading so the NER spans will take effect then.
If you're using standard key names and you don't need any custom filtering and you'll keep using these dataset within Prodigy, there's no need to rehash before writing to DB.
Thank you! Good to know. I was experimenting with the review recipe, and things didn't work for me because I didn't recalculate the hashes.