Duplicate examples shown after restarting server

I noticed that if I rerun ner.make-gold with the same dataset and the same data file, I will (at least sometimes) be asked to label the same example twice. In fact, after exporting the annotated data, I see that two annotated examples have the same input_hash and task_hash.

I’m assuming this isn’t the intended behavior? I can manually remove already-annotated examples from the data file every time I restart ner.make-gold, but it would be nice if the recipe would automatically realize to not show the same example twice.

Thanks for the report. This is strange and definitely not intended and one set shouldn’t contain examples for the same hashes. Did you spot the duplicate examples towards the end of the dataset or last? This would indicate that maybe there’s an unintended overlap somewhere.

I suspect that the cause of this problem might be related to this issue: Old examples are automatically added to new dataset. I’m pretty sure we tracked this down to a problem with how the database handles the input and task hashes, and we’ve already fixed that for the upcoming v1.4.1 (just running a few test and will release the update asap, so you can check if it fixes the problem).

It looked like after restarting the command I got the same examples again from the beginning of the file. But thanks for the quick reply, I’ll see if this got fixed in the new version already.

Sorry for resurrecting such an old post. We are currently seeing the same behavior. After restarting the application, prodigy starts presenting the data from the beginning again. We are working with a MySQL connection. The version we are using is 1.11.7 Any tips?

@baeumer Are the hashes in the incoming examples the same as the hashes of the duplicate examples already in the dataset?