Duplicates in revised annotations

blume · May 28, 2019, 3:51pm

Hi,

we have a problem with duplicates. We used ner.manual for annotation and after a couple thousand annotations (about 6,000), changed the annotation schema, which is why we wanted to go over those 6,000 examples again. We did this by exporting the jsonl with db-out and using it as inputfile for the revised annotations (again using ner.manual but saving into a new empty database db_2).

Now, we exported this new db_2 and realized that it now counts 12,000 annotations, i.e. every annotation exists twice, even with the same input_hash and task_hash. How could that happen, and can we do something directly in prodigy to fix this issue? And how do we know, which examples are the revised ones, is it the second half of annotations, i.e. annotations number 6001 - 12,000?

Thanks in advance!

ines · May 28, 2019, 5:37pm

Hi! Your workflow sounds correct and exactly what I would have recommended. And what you’re seeing there is very strange – if you created a new empty dataset and added new annotations in there from an input file, it should only have those annotations It’s difficult to debug this without the data, but if you can share one example of two tasks in the set with identical task/input hashes, that’d be helpful.

To answer your more direct question: New examples should be appended to existing examples, so if you ever end up with old and new examples in the same set (e.g. if someone accidentally saves re-annotated examples to the old set), those should come last.

Another thing that might be very useful to check: For each annotation session, Prodigy also creates a session dataset named after the timestamp. So if you know when you saved the annotations, you can also export those separately using the session name. For instance, db-out 2019-05-28_12-14-50. To list all your session datasets, you can run prodigy stats -ls.

blume · May 29, 2019, 7:46am

Hi Ines, I’ll have a look at the annotation session timestamps and then we’ll just take the last annotations to go on! Thank you for your help

Topic		Replies	Views
Duplicate annotations in output Getting Started bug , to-be-released , streams	53	3511	January 27, 2023
Duplicated prodigy output in json database , solved	9	672	December 11, 2019
Tasks are duplicated	3	439	June 7, 2023
Duplicated examples in NER.teach & large jsonl files usage , ner , done	5	1439	September 10, 2018
Adding data to a Prodigy dataset using db-in - is there a way to filter out/remove duplicate annotations? usage , solved	2	418	January 4, 2023

Duplicates in revised annotations

Related topics