Prodigy error when reviewing audio annotation coupled with videos

Hello,
We are trying to review some audio annotations done by labelers. After encoding our jsonl file with the right data to base 64, we ended up with a 5gb encoded jsonl file supposedly for ~80 videos.

Running this locally with cat ~/audio_b64.jsonl | prodigy audio.manual rev_audio - --loader jsonl
--label person
couldn't load our file for revieweing the annotations, and prompted with this in our terminal:

⚠ Warning: filtered 99% of entries because they were duplicates. Only 1
items were shown out of 77. You may want to deduplicate your dataset ahead of
time to get a better understanding of your dataset size.

Any idea what might be the cause of that? What can be an alternative?
Thank you.
George.

Hi and sorry for only getting to this now, I must have missed the thread!

How does your JSONL data look under the hood? It sounds like the hashes Prodigy auto-generated for it aren't taking the actual video data into account, so all records end up with the same hashes. When generating the JSONL, you add an _input_hash and _task_hash value (e.g. based on the file name) to help Prodigy distinguish between examples that are identical, different questions about the same data, and entirely different inputs/questions.

Btw, if your data is this large, you might also want to consider a different loading strategy so you don't end up with these huge JSONL files and base64 strings that are sent back and forth. One option could be to use an S3 bucket (or similar) and only have your JSONL contain the URLs.

Hello Ines,
Yes turns out under the hood my jsonl file have actually the same _input_hash only the _task_hash are different.
I tried dropping the _input_hash and changing the number by one or two digits so it differs a bit, in both cases it was still raising the same error. Any ideas on what to do next?

I switched to reading from the url directly, thank you for that, I also thought it was maybe a problem of encoding to base 64.

Thank you.

Okay, so you mean, you still saw the warning about 99% of entries being filtered? I just double-checked and the audio recipe actually re-hashes the stream by default, and for some reason, it doesn't seem to take the "video" key into account.

In the meantime, can you try assigning a unique "text" value to each entry? This should do the trick and convince Prodigy that the records are all different.