Duplicate annotations in output

Quick question for you both because I was able to identify a specific case where duplicates were being shown. How quickly are you answering questions? If I answer questions very quickly (basically just immediately hitting accept) I can run into a state where duplicates are queued. I've fixed the underlying issue for this and this will be released in version 1.11.7 however I'm not convinced this is the only issue. Thanks for bearing with us on this!

We had some really simple tasks (true or false) where this happened, so often 5-10 seconds. We had more complex tasks, though, where it took a bit longer and this occurred, but still most likely less than a minute per doc.

We are using ner.correct and when the model is 100% right, we see it in less then a second and accept really quickly!

Great thank you both for the replies. I think we'll release a fix for this particular issue early next week, sounds like you both might be running into it.

1 Like

Thanks @kab! How can we access the fix when it becomes available? Via an emailed link?

We just published an alpha version (1.11.7a0). We're still trying to track down another potential issue with duplicates but please try this out and report any issues back on this thread when you get the chance.

You can install from our PyPI server using your Prodigy License Key:

pip install prodigy==1.11.7a0 -f https://XXXX-XXXX-XXXX-XXXX@download.prodi.gy

Linking to my reply to another thread which might be related to this one.

I was unable to install 1.11.7a0 using the license key. Is there a wheel available?

What was the exact problem here? The PyPi server is really the easiest and most convenient way for us to distribute alpha releases. If there's no other way we could mail you a wheel, but it'd be better to get the PyPi download working.

ERROR: Could not find a version that satisfies the requirement prodigy==1.11.7a0 (from versions: none)
ERROR: No matching distribution found for prodigy==1.11.7a0

I just tested it and it works fine for me :thinking: Are you sure your license key is active and correct?

Yes, we received the license keys on September 13, 2021.

Can you install other versions via the PyPi server? I definitely want to investigate whether there's a problem with your license key so maybe you could send us an email with your order ID and/or license key so I can have a look?

I tried this again today with a different key (we have company list of 5) and didn't get the error, so I have successfully upgraded.

Sorry for the delay! I did a test with the latest update with two annotators, same dataset, each of us annotating the same data to understand inter-annotator agreement. I did not have any dupes, but my colleague did have 10. It seemed to coincide with server interruptions. My settings in prodigy.json:

"feed_overlap": true,
"force_stream_order": true,

Thanks for the update and that's interesting! What exactly do you mean by server interruptions, is this the server being stopped manually or being temporarily unreachable?

Hi,
My team is having the same issue reported here with the duplicates. We have Prodigy's latest version (1.11.7a0), and feed_overlap set to false but still getting duplicates within the session, and at times the same data is shown for different users. A "loop issue" like the one reported in a different thread linked here was also spotted. Waiting for any updates, thank you!

1 Like

What's your annotation process like with the different annotators and is it possible you're hitting a scenario where "work stealing" kicks in?

Dear Prodigy Team,

We are using prodigy to classify each of a set of Chinese language documents into one of n buckets. Our problem: We want our human labeller to label each document only once, but sometimes the UI presents the same document several times. We can see this multiple suggestion issue by inspecting the table prodigy writes to, after we have labelled a sequence of documents. We did some analysis on this table using pandas (see below for code and output). Based on this, we believe the problem may be related to the batch_size parameter: Each document seems to be presented twice, often exactly but always roughly batch_size clicks apart. For example, if the batch_size is 5, the 6th document seen by the labeller will be the same as the 11th document. We verified this for batch_size values of 5 and 10. Note that the prodigy itself is aware of the duplication, in the sense that it assigns the same _input_hash variable to identical documents. We may simply be using the wrong settings, either in the config file or the recipe itself (we have reproduced these, below). In addition to batch_size, so far we have tried adjusting the following settings:

  1. feed_overlap
  2. instant_submit

However, no combination of settings solved the problem. Also, we noticed that submitting answers very quickly to the UI seems to generate more repeats in a row. Below are the recipe and config files, along with the python code we used to investigate the repeated suggestions, which revealed the “every 5th document” issue.

The code we used to discover duplicates is as below:

from prodigy.components.db import connect
import pandas as pd

db = connect()

df = pd.DataFrame(db.get_dataset('debug'))

# for a given _input_hash, get the number of rows in the dataframe that have the given input hash.
dupes_dict = {hash: len(df[df['_input_hash'] == hash]) 
              for hash in df['_input_hash'].unique()}

# only keep pairs with dupes
dupes_only_dict = {k: v for k,v in dupes_dict.items() if v > 1 }

df['idx'] = df.index
df[df['_input_hash'].isin(dupes_only_dict.keys())].sort_values(
    by=['_input_hash', 'idx'])[['text', '_input_hash', '_task_hash', 'accept', 'answer', '_timestamp']]

Result: (with batch_size 5 and instant_submit set to false)

This is the recipe used:

import prodigy
from prodigy.components.filters import filter_duplicates
from prodigy.components.loaders import JSONL

@prodigy.recipe('debug')
def debug_labeller():

    dataset = 'debug'
    source = 'data/debug_dataset.jsonl'
    choices = [
        dict(id='mneg', text='Negative'),
        dict(id='mneu', text='Neutral'),
        dict(id='mpos', text='Positive')
    ]

    def add_options(stream):
        for eg in stream:
            eg['options'] = choices
            yield eg

    stream = JSONL(source)
    stream = (prodigy.set_hashes(eg, input_keys=('title', 'text'))
              for eg in stream)
    stream = filter_duplicates(stream, by_input=True, by_task=False)
    stream = add_options(stream)
    stream = list(stream)

    config = {
        'blocks': [{
            'view_id': 'html',
            'html_template': '<h3>{{title}}</h3>'
        }, {
            'view_id': 'choice',
        }],
        'instructions': './docs/instructions/instructions.html',
        'choice_style': 'multiple',
        'choice_auto_accept': False,
        'feed_overlap': False,
        'port': 8023
    }
    return {
        'dataset': dataset,
        'exclude': [dataset],
        'stream': stream,
        'view_id': 'blocks',
        'config': config
    }

This is our prodigy.json config:

{
  "theme": "basic",
  "custom_theme": {
    "cardMaxWidth": 1920
  },
  "batch_size": 5,
  "history_size": 10,
  "host": "0.0.0.0",
  "cors": true,
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
			"details": "omitted"
    }
  },
  "keymap": {
    "accept": ["space"],
    "save": ["command+space"],
    "ignore": ["i"]
  },
  "validate": true,
  "auto_exclude_current": true,
  "instant_submit": true,
  "feed_overlap": false,
  "auto_count_stream": true,
  "ui_lang": "en",
  "project_info": [
    "dataset",
    "session",
    "lang",
    "recipe_name",
    "view_id",
    "label"
  ],
  "show_stats": true,
  "hide_meta": false,
  "show_flag": false,
  "swipe": true,
  "swipe_gestures": { "right": "accept", "left": "reject" },
  "split_sents_threshold": false,
  "global_css": null,
  "javascript": null,
  "writing_dir": "ltr",
  "show_whitespace": false,
  "exclude_by": "input"
}

Thanks,
David

@leetdavid Thanks for the report, I've merged the post onto a previous thread since it seems to be related to the same problem/questions.

One thing to look out for is whether the work stealing timeout might be the culprit and whether your annotation workflow hits scenarios where previously unanswered questions are added back to the queue too early: