Duplicate annotations in output

cheyanneb · November 16, 2021, 4:59pm

Hi,

We have an input .jsonl file with 1000 pre-annotated documents. We have multiple users (3) working on the same dataset. The output contained 1025 annotations for all users, so the users annotated 25 of the documents twice. Input and task hashes are the same in the output for the dupes. We are all on the latest version of Prodigy (1.11.5). We did not use force_stream_order.

User 1:

used a dataset with 1000 lines with no duplicates
saw 25 duplicates when running db-out.
saw 1025 completed tasks in the UI
was on 1.11.5 the entire time
dupes differed from other users
used textcat.manual recipe and a different dataset name than other two users

User 2:

used a dataset with 1000 lines with no duplicates
saw only 4 duplicates when running db-out.
saw 1025 completed tasks in the UI
upgraded from 1.11.2 to 1.11.5 during annotation
dupes differed from other users
used mark recipe, same dataset name as User 3

User 3:

used a dataset with 1000 lines with no duplicates
saw 1004 completed tasks in the UI
upgraded from 1.11.3 to 1.11.5 during annotation
used mark recipe, same dataset name as User 2
db-out shows 2008 lines, 1004 from User 2 and User 3
db-out showed 4 dupes (did not match dupes from other users)

We had set feed_overlap to true so that we all saw the same documents in the dataset. We are using the mark recipe.

PRODIGY_ALLOWED_SESSIONS=jane,john prodigy mark october_dataset <path_to_file>.jsonl --view-id classification

Thanks,
Cheyanne

kab · November 24, 2021, 8:00pm

Hi! Seems you definitely found an issue here. I'll be taking a look today. It would be helpful to know around what point in your input data you start to see duplicates if that's something you can share (towards the beginning, somewhere in the middle, only at the end).

cheyanneb · November 29, 2021, 4:19pm

Thank you for looking into this! We all noticed some dupes before all tasks were complete (documents seemed familiar, and db-out confirmed the dupes), but Users 2 and 3 upgraded Prodigy to the latest version at that time and thought that might solve the issue. It didn't solve the issue. The significant dupes were discovered at the end when each user completed their tasks. We all had the same input (1000 documents), but the UI showed additional tasks and db-out confirmed we each had dupes in our final set.

cheyanneb · December 7, 2021, 7:15pm

Had another instance of duplicates in the output in a recent annotation exercise, this time with a custom recipe. This was a single user. Original file had 828 documents, and the user received 840 tasks. Any update on this @kab?

marc · December 8, 2021, 3:49am

Hello,

I am seeing a very similar bug. I was, at first, using named sessions. Thinking it could be the problem, I stopped using them and the problem still occurs. Prodigy asks to annotate doc that have already been annotated.

I am on version 1.11.6 and I did not upgrade during the process.

Looks like prodigy progress is able to spot that there are duplicates :

           New   Unique   Total   Unique
--------   ---   ------   -----   ------
Dec 2021   122      109     122      109

Start command :

prodigy ner.correct testdataset ./model_7500/model-best ./shuffled_data.json --label  NAME,PHONE

And my prodigy.json file :

{
    "host": "0.0.0.0",
    "port": 8081,
    "show_stats": true,
    "show_flag": true,
    "ui_lang": "fr",
    "feed_overlap": false,
    "custom_theme": {
        "labels": {
            "NAME": "#fabed4",
            "PHONE": "#aaffc3"
        }
    },
    "keymap_by_label": {"NAME": "q", "PHONE": "e"},
    "keymap": {"accept":["d"]}
}

Thanks

kab · December 8, 2021, 3:48pm

Hi sorry for the lack of response, I have been looking into this a bit but it's been tough to debug. I'm dedicating the rest of my week to it and I'll update this thread when I make some headway.

kab · December 9, 2021, 6:17pm

Quick question for you both because I was able to identify a specific case where duplicates were being shown. How quickly are you answering questions? If I answer questions very quickly (basically just immediately hitting accept) I can run into a state where duplicates are queued. I've fixed the underlying issue for this and this will be released in version 1.11.7 however I'm not convinced this is the only issue. Thanks for bearing with us on this!

cheyanneb · December 9, 2021, 6:32pm

We had some really simple tasks (true or false) where this happened, so often 5-10 seconds. We had more complex tasks, though, where it took a bit longer and this occurred, but still most likely less than a minute per doc.

marc · December 10, 2021, 8:40pm

We are using ner.correct and when the model is 100% right, we see it in less then a second and accept really quickly!

kab · December 10, 2021, 10:54pm

Great thank you both for the replies. I think we'll release a fix for this particular issue early next week, sounds like you both might be running into it.

cheyanneb · December 13, 2021, 1:33pm

Thanks @kab! How can we access the fix when it becomes available? Via an emailed link?

kab · December 14, 2021, 10:53pm

We just published an alpha version (1.11.7a0). We're still trying to track down another potential issue with duplicates but please try this out and report any issues back on this thread when you get the chance.

You can install from our PyPI server using your Prodigy License Key:

pip install prodigy==1.11.7a0 -f https://XXXX-XXXX-XXXX-XXXX@download.prodi.gy

valentinoli · December 15, 2021, 9:39am

Linking to my reply to another thread which might be related to this one.

cheyanneb · December 15, 2021, 5:12pm

I was unable to install 1.11.7a0 using the license key. Is there a wheel available?

ines · December 15, 2021, 6:54pm

What was the exact problem here? The PyPi server is really the easiest and most convenient way for us to distribute alpha releases. If there's no other way we could mail you a wheel, but it'd be better to get the PyPi download working.

cheyanneb · December 15, 2021, 6:58pm

ERROR: Could not find a version that satisfies the requirement prodigy==1.11.7a0 (from versions: none)
ERROR: No matching distribution found for prodigy==1.11.7a0

ines · December 15, 2021, 7:01pm

I just tested it and it works fine for me Are you sure your license key is active and correct?

cheyanneb · December 15, 2021, 7:38pm

Yes, we received the license keys on September 13, 2021.

ines · December 16, 2021, 12:05pm

Can you install other versions via the PyPi server? I definitely want to investigate whether there's a problem with your license key so maybe you could send us an email with your order ID and/or license key so I can have a look?

cheyanneb · December 16, 2021, 2:30pm

I tried this again today with a different key (we have company list of 5) and didn't get the error, so I have successfully upgraded.

Topic		Replies	Views
Duplicated annotation when changing version ner , spacy	6	556	November 9, 2022
Refresh browser fix with force_stream_order bug , usage , done , streams	48	3976	January 4, 2021
Few records in in the db for the same example usage	26	630	June 13, 2023
Example repeated/duplicated within and across sessions usage , textcat , multi-user	5	476	December 20, 2022
Help! I have duplicates or missing data: Best practices on accounting for annotations best-practices	6	705	September 13, 2023

Duplicate annotations in output

Related topics