Duplicate annotations in output

Hi,

We have an input .jsonl file with 1000 pre-annotated documents. We have multiple users (3) working on the same dataset. The output contained 1025 annotations for all users, so the users annotated 25 of the documents twice. Input and task hashes are the same in the output for the dupes. We are all on the latest version of Prodigy (1.11.5). We did not use force_stream_order.

User 1:

  • used a dataset with 1000 lines with no duplicates
  • saw 25 duplicates when running db-out.
  • saw 1025 completed tasks in the UI
  • was on 1.11.5 the entire time
  • dupes differed from other users
  • used textcat.manual recipe and a different dataset name than other two users

User 2:

  • used a dataset with 1000 lines with no duplicates
  • saw only 4 duplicates when running db-out.
  • saw 1025 completed tasks in the UI
  • upgraded from 1.11.2 to 1.11.5 during annotation
  • dupes differed from other users
  • used mark recipe, same dataset name as User 3

User 3:

  • used a dataset with 1000 lines with no duplicates
  • saw 1004 completed tasks in the UI
  • upgraded from 1.11.3 to 1.11.5 during annotation
  • used mark recipe, same dataset name as User 2
  • db-out shows 2008 lines, 1004 from User 2 and User 3
  • db-out showed 4 dupes (did not match dupes from other users)

We had set feed_overlap to true so that we all saw the same documents in the dataset. We are using the mark recipe.

PRODIGY_ALLOWED_SESSIONS=jane,john prodigy mark october_dataset <path_to_file>.jsonl --view-id classification

Thanks,
Cheyanne

Hi! Seems you definitely found an issue here. I'll be taking a look today. It would be helpful to know around what point in your input data you start to see duplicates if that's something you can share (towards the beginning, somewhere in the middle, only at the end).

Thank you for looking into this! We all noticed some dupes before all tasks were complete (documents seemed familiar, and db-out confirmed the dupes), but Users 2 and 3 upgraded Prodigy to the latest version at that time and thought that might solve the issue. It didn't solve the issue. The significant dupes were discovered at the end when each user completed their tasks. We all had the same input (1000 documents), but the UI showed additional tasks and db-out confirmed we each had dupes in our final set.

Had another instance of duplicates in the output in a recent annotation exercise, this time with a custom recipe. This was a single user. Original file had 828 documents, and the user received 840 tasks. Any update on this @kab?

Hello,

I am seeing a very similar bug. I was, at first, using named sessions. Thinking it could be the problem, I stopped using them and the problem still occurs. Prodigy asks to annotate doc that have already been annotated.

I am on version 1.11.6 and I did not upgrade during the process.

Looks like prodigy progress is able to spot that there are duplicates :

           New   Unique   Total   Unique
--------   ---   ------   -----   ------
Dec 2021   122      109     122      109

Start command :

prodigy ner.correct testdataset ./model_7500/model-best ./shuffled_data.json --label  NAME,PHONE

And my prodigy.json file :

{
    "host": "0.0.0.0",
    "port": 8081,
    "show_stats": true,
    "show_flag": true,
    "ui_lang": "fr",
    "feed_overlap": false,
    "custom_theme": {
        "labels": {
            "NAME": "#fabed4",
            "PHONE": "#aaffc3"
        }
    },
    "keymap_by_label": {"NAME": "q", "PHONE": "e"},
    "keymap": {"accept":["d"]}
}

Thanks

Hi sorry for the lack of response, I have been looking into this a bit but it's been tough to debug. I'm dedicating the rest of my week to it and I'll update this thread when I make some headway.

2 Likes

Quick question for you both because I was able to identify a specific case where duplicates were being shown. How quickly are you answering questions? If I answer questions very quickly (basically just immediately hitting accept) I can run into a state where duplicates are queued. I've fixed the underlying issue for this and this will be released in version 1.11.7 however I'm not convinced this is the only issue. Thanks for bearing with us on this!

We had some really simple tasks (true or false) where this happened, so often 5-10 seconds. We had more complex tasks, though, where it took a bit longer and this occurred, but still most likely less than a minute per doc.

We are using ner.correct and when the model is 100% right, we see it in less then a second and accept really quickly!

Great thank you both for the replies. I think we'll release a fix for this particular issue early next week, sounds like you both might be running into it.

1 Like

Thanks @kab! How can we access the fix when it becomes available? Via an emailed link?

We just published an alpha version (1.11.7a0). We're still trying to track down another potential issue with duplicates but please try this out and report any issues back on this thread when you get the chance.

You can install from our PyPI server using your Prodigy License Key:

pip install prodigy==1.11.7a0 -f https://XXXX-XXXX-XXXX-XXXX@download.prodi.gy

Linking to my reply to another thread which might be related to this one.

I was unable to install 1.11.7a0 using the license key. Is there a wheel available?

What was the exact problem here? The PyPi server is really the easiest and most convenient way for us to distribute alpha releases. If there's no other way we could mail you a wheel, but it'd be better to get the PyPi download working.

ERROR: Could not find a version that satisfies the requirement prodigy==1.11.7a0 (from versions: none)
ERROR: No matching distribution found for prodigy==1.11.7a0

I just tested it and it works fine for me :thinking: Are you sure your license key is active and correct?

Yes, we received the license keys on September 13, 2021.

Can you install other versions via the PyPi server? I definitely want to investigate whether there's a problem with your license key so maybe you could send us an email with your order ID and/or license key so I can have a look?

I tried this again today with a different key (we have company list of 5) and didn't get the error, so I have successfully upgraded.