we have a problem with duplicates. We used ner.manual for annotation and after a couple thousand annotations (about 6,000), changed the annotation schema, which is why we wanted to go over those 6,000 examples again. We did this by exporting the jsonl with db-out and using it as inputfile for the revised annotations (again using ner.manual but saving into a new empty database
Now, we exported this new
db_2 and realized that it now counts 12,000 annotations, i.e. every annotation exists twice, even with the same input_hash and task_hash. How could that happen, and can we do something directly in prodigy to fix this issue? And how do we know, which examples are the revised ones, is it the second half of annotations, i.e. annotations number 6001 - 12,000?
Thanks in advance!
Hi! Your workflow sounds correct and exactly what I would have recommended. And what you’re seeing there is very strange – if you created a new empty dataset and added new annotations in there from an input file, it should only have those annotations It’s difficult to debug this without the data, but if you can share one example of two tasks in the set with identical task/input hashes, that’d be helpful.
To answer your more direct question: New examples should be appended to existing examples, so if you ever end up with old and new examples in the same set (e.g. if someone accidentally saves re-annotated examples to the old set), those should come last.
Another thing that might be very useful to check: For each annotation session, Prodigy also creates a session dataset named after the timestamp. So if you know when you saved the annotations, you can also export those separately using the session name. For instance,
db-out 2019-05-28_12-14-50. To list all your session datasets, you can run
prodigy stats -ls.
Hi Ines, I’ll have a look at the annotation session timestamps and then we’ll just take the last annotations to go on! Thank you for your help