Repeated sentences and incorrect annotator id

Hi,

My team is working on annotating both NER and RE with the rel.manual recipe. For this, I am using the following config (prodigy.json):

{
    "feed_overlap": true,
    "custom_theme": {
        "cardMaxWidth": 1500,
        "smallText": 16,
        "relationHeightWrap": 40
    }
}

I'm specifying the sessions with PRODIGY_ALLOWED_SESSIONS=jane,joe,sarah,ale. That is for 3 annotators and me (ale). My session is just for testing.

We started with one database, let's call it test_v1, and an input jsonl file, let's say data.jsonl, which contains all texts we want to annotate. Let's say that ~300 out of 1000 got annotated for the db test_v1.

After a while we modified our annotation rules, so I decided to create a second version of the database (test_v2) in a new Prodigy instance. For the input texts this time, I pulled a subset of texts from data.jsonl to create a data_v2.jsonl. This subset may have some overlapping sentences with the 300 that were previously annotated (I selected texts from line 300 and onwards of data.jsonl).

When the annotators started seeing repeated sentences I thought it was the ones overlapping between data_v2.jsonl and the 300 they had annotated in test_v1. However, after close examination I see this is not the case. I exported test_v2 using pgy db-out. The repeated sentences reported by one annotator had 4 annotations, which is worrying given there are only 3 annotator and my session is not being used. When looking at the annotator_id and session_id something weird shows up:

3 annotations look like this (correct annotator and session ids):

"_annotator_id":"test_v2-jane","_session_id":"test_v2-jane"
"_annotator_id":"test_v2-joe","_session_id":"test_v2-joe"
"_annotator_id":"test_v2-sarah","_session_id":"test_v2-sarah"

But a forth annotation has test_v1 (incorrect database) in the annotator and session ids:

"_annotator_id":"test_v1-jane","_session_id":"test_v1-jane"

Why could this be happening?

Update: Only one of the three annotators is experiencing the issue. One thing to note is that this annotator (jane) was working on test_v1 on a browser session when I asked her to save the progress as I was going to restart the server to set up a new instance for test_v2. Once the new instance was running, she may have kept annotating on that same browser window she was using for test_v1.

Thanks

Hi @ale,

The reason that this is happening is that the Prodigy front end does not automatically shut down together with the server (which is why jane could have maintain their tab active) and it will try to "talk to" any Prodigy server available. It will only error out when there's no server available. In this case, dataset_v2 server was available which is why jane's answers got stored there.

At the moment there is no mechanism in place that would ensure the correct matching the front-end with the backend server. The only way to prevent this would be to explicitly instruct the annotators to shut town their browser if the answers are supposed to be stored in a new dataset.
Thinking about there perhaps should be some sort of front message that the server has been shut down and the the user should refresh to make sure the session id corresponds to the target dataset. We'll definitely discuss it internally.

1 Like