I have a custom recipe that does some amount of data transformation (e.g. creating multiple examples from one input) but by having it print everything just before the
yield statement, it seems to generate the examples I want, each with unique input and task hashes.
However, when I start annotating I see duplication. With default
batch_size 10 and if I start indexing the examples I see from the
yield task from 1, the pattern is
1, 2, 3,.... 22, 23, 1, 2, (and I stop tracking).
batch_size 1, the pattern is
1, 2, 3, 1, 2, 3, 4, 5, 6, 4, 5, 6, ... (i.e. it seems every 3 examples are shown twice and then we move on to the next 3)
db-out a dataset generated by this, the resulting jsonl has the duplicate entries - the whole lines are exactly the same, including the hashes (assuming same annotation on both passes).
exclude_by is left/set to
task, which I thought should not allow the above to occur.
I'm using Prodigy 1.9.9, default DB.