We have an input .jsonl file with 1000 pre-annotated documents. We have multiple users (3) working on the same dataset. The output contained 1025 annotations for all users, so the users annotated 25 of the documents twice. Input and task hashes are the same in the output for the dupes. We are all on the latest version of Prodigy (1.11.5). We did not use force_stream_order.
User 1:
used a dataset with 1000 lines with no duplicates
saw 25 duplicates when running db-out.
saw 1025 completed tasks in the UI
was on 1.11.5 the entire time
dupes differed from other users
used textcat.manual recipe and a different dataset name than other two users
User 2:
used a dataset with 1000 lines with no duplicates
saw only 4 duplicates when running db-out.
saw 1025 completed tasks in the UI
upgraded from 1.11.2 to 1.11.5 during annotation
dupes differed from other users
used mark recipe, same dataset name as User 3
User 3:
used a dataset with 1000 lines with no duplicates
saw 1004 completed tasks in the UI
upgraded from 1.11.3 to 1.11.5 during annotation
used mark recipe, same dataset name as User 2
db-out shows 2008 lines, 1004 from User 2 and User 3
db-out showed 4 dupes (did not match dupes from other users)
We had set feed_overlap to true so that we all saw the same documents in the dataset. We are using the mark recipe.
PRODIGY_ALLOWED_SESSIONS=jane,john prodigy mark october_dataset <path_to_file>.jsonl --view-id classification
Hi! Seems you definitely found an issue here. I'll be taking a look today. It would be helpful to know around what point in your input data you start to see duplicates if that's something you can share (towards the beginning, somewhere in the middle, only at the end).
Thank you for looking into this! We all noticed some dupes before all tasks were complete (documents seemed familiar, and db-out confirmed the dupes), but Users 2 and 3 upgraded Prodigy to the latest version at that time and thought that might solve the issue. It didn't solve the issue. The significant dupes were discovered at the end when each user completed their tasks. We all had the same input (1000 documents), but the UI showed additional tasks and db-out confirmed we each had dupes in our final set.
Had another instance of duplicates in the output in a recent annotation exercise, this time with a custom recipe. This was a single user. Original file had 828 documents, and the user received 840 tasks. Any update on this @kab?
I am seeing a very similar bug. I was, at first, using named sessions. Thinking it could be the problem, I stopped using them and the problem still occurs. Prodigy asks to annotate doc that have already been annotated.
I am on version 1.11.6 and I did not upgrade during the process.
Looks like prodigy progress is able to spot that there are duplicates :
New Unique Total Unique
-------- --- ------ ----- ------
Dec 2021 122 109 122 109
Hi sorry for the lack of response, I have been looking into this a bit but it's been tough to debug. I'm dedicating the rest of my week to it and I'll update this thread when I make some headway.
Quick question for you both because I was able to identify a specific case where duplicates were being shown. How quickly are you answering questions? If I answer questions very quickly (basically just immediately hitting accept) I can run into a state where duplicates are queued. I've fixed the underlying issue for this and this will be released in version 1.11.7 however I'm not convinced this is the only issue. Thanks for bearing with us on this!
We had some really simple tasks (true or false) where this happened, so often 5-10 seconds. We had more complex tasks, though, where it took a bit longer and this occurred, but still most likely less than a minute per doc.
Great thank you both for the replies. I think we'll release a fix for this particular issue early next week, sounds like you both might be running into it.
We just published an alpha version (1.11.7a0). We're still trying to track down another potential issue with duplicates but please try this out and report any issues back on this thread when you get the chance.
You can install from our PyPI server using your Prodigy License Key:
What was the exact problem here? The PyPi server is really the easiest and most convenient way for us to distribute alpha releases. If there's no other way we could mail you a wheel, but it'd be better to get the PyPi download working.
ERROR: Could not find a version that satisfies the requirement prodigy==1.11.7a0 (from versions: none)
ERROR: No matching distribution found for prodigy==1.11.7a0
Can you install other versions via the PyPi server? I definitely want to investigate whether there's a problem with your license key so maybe you could send us an email with your order ID and/or license key so I can have a look?