Hello,
We are trying to review some audio annotations done by labelers. After encoding our jsonl file with the right data to base 64, we ended up with a 5gb encoded jsonl file supposedly for ~80 videos.
Running this locally with cat ~/audio_b64.jsonl | prodigy audio.manual rev_audio - --loader jsonl --label person
couldn't load our file for revieweing the annotations, and prompted with this in our terminal:
⚠ Warning: filtered 99% of entries because they were duplicates. Only 1
items were shown out of 77. You may want to deduplicate your dataset ahead of
time to get a better understanding of your dataset size.
Any idea what might be the cause of that? What can be an alternative?
Thank you.
George.
Hi and sorry for only getting to this now, I must have missed the thread!
How does your JSONL data look under the hood? It sounds like the hashes Prodigy auto-generated for it aren't taking the actual video data into account, so all records end up with the same hashes. When generating the JSONL, you add an _input_hash and _task_hash value (e.g. based on the file name) to help Prodigy distinguish between examples that are identical, different questions about the same data, and entirely different inputs/questions.
Btw, if your data is this large, you might also want to consider a different loading strategy so you don't end up with these huge JSONL files and base64 strings that are sent back and forth. One option could be to use an S3 bucket (or similar) and only have your JSONL contain the URLs.
Hello Ines,
Yes turns out under the hood my jsonl file have actually the same _input_hash only the _task_hash are different.
I tried dropping the _input_hash and changing the number by one or two digits so it differs a bit, in both cases it was still raising the same error. Any ideas on what to do next?
I switched to reading from the url directly, thank you for that, I also thought it was maybe a problem of encoding to base 64.
Okay, so you mean, you still saw the warning about 99% of entries being filtered? I just double-checked and the audio recipe actually re-hashes the stream by default, and for some reason, it doesn't seem to take the "video" key into account.
In the meantime, can you try assigning a unique "text" value to each entry? This should do the trick and convince Prodigy that the records are all different.
Hello Ines,
For the text and _input_hash values I assigned different values respectively for each entry. I am trying with a small jsonl file now to see.
Should an entry in my file be looking something like this? where should I put the url link? -- Keep in mind that I am trying to run this locally first and then will try it on an EC2 instance where we actually have the data as well.
That looks good, but you still need a key "video" or "audio" in your entry that includes the base64-encoded data or a URL to the file (can't be a local path, though, because your browser will most likely block this for security reasons).
If I understood correctly if I connect that to my ec2 data, that would be blocked by my browser?
I was trying to fetch the video files from my s3 by using boto3 and botocore but no luck, can you point out an example somewhere?
From what I have understood, that would change the loaders in our command line as well, right?
So eventually our command line should be looking something like that: python3 -m prodigy audio_custom.manual testdb --label singer -F audio_recipe_s3.py -
If the image path is a http(s) URL, that's fine – but your browser would likely block paths like /users/you/some_file.jpg (referring to local paths on your filesystem) and you typically don't want to disable this behaviour either. This thread has some more background on this:
If you're using one of the built-in recipes, you can set the --loader argument to define how the file should be loaded. So if you want to load your videos from a JSONL file containing URLs, you'd set --loader jsonl.