Duplicate annotations in output

Hi,

We have an input .jsonl file with 1000 pre-annotated documents. We have multiple users (3) working on the same dataset. The output contained 1025 annotations for all users, so the users annotated 25 of the documents twice. Input and task hashes are the same in the output for the dupes. We are all on the latest version of Prodigy (1.11.5). We did not use force_stream_order.

User 1:

  • used a dataset with 1000 lines with no duplicates
  • saw 25 duplicates when running db-out.
  • saw 1025 completed tasks in the UI
  • was on 1.11.5 the entire time
  • dupes differed from other users
  • used textcat.manual recipe and a different dataset name than other two users

User 2:

  • used a dataset with 1000 lines with no duplicates
  • saw only 4 duplicates when running db-out.
  • saw 1025 completed tasks in the UI
  • upgraded from 1.11.2 to 1.11.5 during annotation
  • dupes differed from other users
  • used mark recipe, same dataset name as User 3

User 3:

  • used a dataset with 1000 lines with no duplicates
  • saw 1004 completed tasks in the UI
  • upgraded from 1.11.3 to 1.11.5 during annotation
  • used mark recipe, same dataset name as User 2
  • db-out shows 2008 lines, 1004 from User 2 and User 3
  • db-out showed 4 dupes (did not match dupes from other users)

We had set feed_overlap to true so that we all saw the same documents in the dataset. We are using the mark recipe.

PRODIGY_ALLOWED_SESSIONS=jane,john prodigy mark october_dataset <path_to_file>.jsonl --view-id classification

Thanks,
Cheyanne

Hi! Seems you definitely found an issue here. I'll be taking a look today. It would be helpful to know around what point in your input data you start to see duplicates if that's something you can share (towards the beginning, somewhere in the middle, only at the end).

Thank you for looking into this! We all noticed some dupes before all tasks were complete (documents seemed familiar, and db-out confirmed the dupes), but Users 2 and 3 upgraded Prodigy to the latest version at that time and thought that might solve the issue. It didn't solve the issue. The significant dupes were discovered at the end when each user completed their tasks. We all had the same input (1000 documents), but the UI showed additional tasks and db-out confirmed we each had dupes in our final set.