Duplicate annotations in output

cheyanneb · January 19, 2022, 3:08pm

Sorry for the delay! I did a test with the latest update with two annotators, same dataset, each of us annotating the same data to understand inter-annotator agreement. I did not have any dupes, but my colleague did have 10. It seemed to coincide with server interruptions. My settings in prodigy.json:

"feed_overlap": true,
"force_stream_order": true,

ines · January 19, 2022, 6:07pm

Thanks for the update and that's interesting! What exactly do you mean by server interruptions, is this the server being stopped manually or being temporarily unreachable?

Laura · January 27, 2022, 9:07am

Hi,
My team is having the same issue reported here with the duplicates. We have Prodigy's latest version (1.11.7a0), and feed_overlap set to false but still getting duplicates within the session, and at times the same data is shown for different users. A "loop issue" like the one reported in a different thread linked here was also spotted. Waiting for any updates, thank you!

ines · January 27, 2022, 9:15pm

What's your annotation process like with the different annotators and is it possible you're hitting a scenario where "work stealing" kicks in?

leetdavid · January 28, 2022, 5:54am

Dear Prodigy Team,

We are using prodigy to classify each of a set of Chinese language documents into one of n buckets. Our problem: We want our human labeller to label each document only once, but sometimes the UI presents the same document several times. We can see this multiple suggestion issue by inspecting the table prodigy writes to, after we have labelled a sequence of documents. We did some analysis on this table using pandas (see below for code and output). Based on this, we believe the problem may be related to the batch_size parameter: Each document seems to be presented twice, often exactly but always roughly batch_size clicks apart. For example, if the batch_size is 5, the 6th document seen by the labeller will be the same as the 11th document. We verified this for batch_size values of 5 and 10. Note that the prodigy itself is aware of the duplication, in the sense that it assigns the same _input_hash variable to identical documents. We may simply be using the wrong settings, either in the config file or the recipe itself (we have reproduced these, below). In addition to batch_size, so far we have tried adjusting the following settings:

feed_overlap
instant_submit

However, no combination of settings solved the problem. Also, we noticed that submitting answers very quickly to the UI seems to generate more repeats in a row. Below are the recipe and config files, along with the python code we used to investigate the repeated suggestions, which revealed the “every 5th document” issue.

The code we used to discover duplicates is as below:

from prodigy.components.db import connect
import pandas as pd

db = connect()

df = pd.DataFrame(db.get_dataset('debug'))

# for a given _input_hash, get the number of rows in the dataframe that have the given input hash.
dupes_dict = {hash: len(df[df['_input_hash'] == hash]) 
              for hash in df['_input_hash'].unique()}

# only keep pairs with dupes
dupes_only_dict = {k: v for k,v in dupes_dict.items() if v > 1 }

df['idx'] = df.index
df[df['_input_hash'].isin(dupes_only_dict.keys())].sort_values(
    by=['_input_hash', 'idx'])[['text', '_input_hash', '_task_hash', 'accept', 'answer', '_timestamp']]

Result: (with batch_size 5 and instant_submit set to false)

This is the recipe used:

import prodigy
from prodigy.components.filters import filter_duplicates
from prodigy.components.loaders import JSONL

@prodigy.recipe('debug')
def debug_labeller():

    dataset = 'debug'
    source = 'data/debug_dataset.jsonl'
    choices = [
        dict(id='mneg', text='Negative'),
        dict(id='mneu', text='Neutral'),
        dict(id='mpos', text='Positive')
    ]

    def add_options(stream):
        for eg in stream:
            eg['options'] = choices
            yield eg

    stream = JSONL(source)
    stream = (prodigy.set_hashes(eg, input_keys=('title', 'text'))
              for eg in stream)
    stream = filter_duplicates(stream, by_input=True, by_task=False)
    stream = add_options(stream)
    stream = list(stream)

    config = {
        'blocks': [{
            'view_id': 'html',
            'html_template': '<h3>{{title}}</h3>'
        }, {
            'view_id': 'choice',
        }],
        'instructions': './docs/instructions/instructions.html',
        'choice_style': 'multiple',
        'choice_auto_accept': False,
        'feed_overlap': False,
        'port': 8023
    }
    return {
        'dataset': dataset,
        'exclude': [dataset],
        'stream': stream,
        'view_id': 'blocks',
        'config': config
    }

This is our prodigy.json config:

{
  "theme": "basic",
  "custom_theme": {
    "cardMaxWidth": 1920
  },
  "batch_size": 5,
  "history_size": 10,
  "host": "0.0.0.0",
  "cors": true,
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
			"details": "omitted"
    }
  },
  "keymap": {
    "accept": ["space"],
    "save": ["command+space"],
    "ignore": ["i"]
  },
  "validate": true,
  "auto_exclude_current": true,
  "instant_submit": true,
  "feed_overlap": false,
  "auto_count_stream": true,
  "ui_lang": "en",
  "project_info": [
    "dataset",
    "session",
    "lang",
    "recipe_name",
    "view_id",
    "label"
  ],
  "show_stats": true,
  "hide_meta": false,
  "show_flag": false,
  "swipe": true,
  "swipe_gestures": { "right": "accept", "left": "reject" },
  "split_sents_threshold": false,
  "global_css": null,
  "javascript": null,
  "writing_dir": "ltr",
  "show_whitespace": false,
  "exclude_by": "input"
}

Thanks,
David

ines · January 28, 2022, 10:27am

@leetdavid Thanks for the report, I've merged the post onto a previous thread since it seems to be related to the same problem/questions.

One thing to look out for is whether the work stealing timeout might be the culprit and whether your annotation workflow hits scenarios where previously unanswered questions are added back to the queue too early:

Laura · February 1, 2022, 2:42pm

Thank you, Ines. In that case, is there a way of increasing the time-out option? We saw with logging that the work-stealing mechanism appears too often, we are already saving the data as much as possible. Besides, the log message FEED: re-adding open tasks to stream appears even when there is only one annotator working on the session.

jsnfly · February 8, 2022, 10:06am

We have the same problem of FEED: re-adding open tasks to stream appearing in the logs and having duplicated examples although only one annotator is working.

leetdavid · February 8, 2022, 10:06am

It definitely looks like "work stealing" is happening. Will the option to configure the "work stealing" threshold be available in the next coming version?

liorm · February 13, 2022, 4:01pm

Hey, I'm experiencing the exact same issue. One user is annotating and getting duplicate documents, in the same order also. In the logs I do see FEED: re-adding open tasks to stream.

We are wasting lots of time because of this problem. When it's a single annotator they understand immediately that there are duplicates (and still than, I don't really have anything to do to solve it) but in other cases when multiple people are annotating it takes time to understand they are annotating documents that were already annotated by colleagues.

Not so sure what are the next steps you plan for solving this issue. Is there a release date for a solution? workaround we can do in the meantime?

Again, we are losing hours of annotations because of this issue and it also leads to huge frustration within the annotators.

Thanks you in advance @ines

kab · February 14, 2022, 6:25pm

Hi all, I'm looking into this. I'm unable to easily reproduce the errors you all are seeing using version 1.11.7. Can you all confirm you're using 1.11.7? The logging of FEED: re-adding open tasks to stream is misleading here, I believe there's a deeper issuer (but we will clean up the logging in the next release so debugging is easier).

jsnfly · February 15, 2022, 8:03am

For us it is 1.11.5, but maybe someone else can confirm for 1.11.7?

kab · February 15, 2022, 6:40pm

So for context, we've seen some issues with duplicate inputs showing up for annotators in previous versions (including 1.11.5 and some edge cases in 1.11.6). So if you can try again with version 1.11.7 and see if your issue goes away that would be great! If you are still seeing issues in 1.11.7 that definitely points to a deeper issue.

Verdiana · February 16, 2022, 11:05am

Hello,
I created a custom recipe to allow a multi-user annotation, and every audio file is presented to every annotator (so with feed_overlap: True). I am using Prodigy 1.11.7 and I ran into the same duplicates issue.
Here is what I tried so far :

allow multi sessions, feed_overlap set to True;
allow multi sessions, feed_overlap set to False;
allow only one session, feed_overlap set to True;
allow only one session, feed_overlap set to False.

I can confirm that I saw duplicated audio files no matter what I tried.

Verdiana · February 16, 2022, 3:34pm

Hello,
I created a custom recipe to allow a multi-user annotation, and every audio file is presented to every annotator (so with feed_overlap: True). I am using Prodigy 1.11.7 and I ran into the same duplicates issue.

After trying a lot of things, it seems that setting batch_size: 1 and instant_submit: True fixed the issue.

kab · February 17, 2022, 5:58pm

Thanks for sharing, good to know you're still seeing issues in Prodigy 1.11.7. I'm investigating this. And thanks for sharing this solution you found. I do think that's a good workaround for those having this issue while we work out a more stable solution.

Laura · March 14, 2022, 10:25am

Hi Kabir, please keep us in the loop when this issue is resolved. Thank you so much!

kab · March 14, 2022, 6:45pm

Hi all, sorry for the delay here. We're working on an experimental feature that will be available via feature flag in the next Prodigy release. It's a totally new approach to our internal Feed system and hopefully resolves these issues for good.

We're planning to get it released in the next 2-3 weeks (perhaps in an alpha release sooner than that).

Will update this thread once that version is released and folks can try it out.

kab · March 31, 2022, 6:27pm

Hi everyone,

We've been working hard on this new internal Feed system and I just wanted to update everyone waiting on a fix here. We're really close to releasing the alpha version and plan on doing that next week.

Thanks for your patience.

-Kabir

kab · April 7, 2022, 5:39pm

Overview

Hi everyone, we are excited to have an alpha version of the new internal Feed system and Database based on SQLAlchemy up on our PyPI index.
If you've been getting duplicate examples shown to annotators, we would love for you to try out the new version and new feed and respond here with any comments/questions/issues.

Installation

The new version is 1.11.8a1 which you can install with:

pip install prodigy==1.11.8a1 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy`

Usage

To turn on the new Feed and new Database, add the experimental_feed setting in your prodigy.json

// prodigy.json
{
    "experimental_feed": true
}

Prodigy currently uses Peewee as the ORM for the Database. The new Feed system stores all examples that are to be shown to annotators in the Database. Naturally this requires some schema changes, so we are taking this opportunity to switch to SQLAlchemy for our ORM. We're excited that this will enable the usage of more database systems beyond SQLite, PostgreSQL, and MySQL however it will change how you set the database settings in your prodigy.json.

If you have solely been using the default SQLite database (i.e. not setting database settings in your prodigy.json at all) we have a new default name for the SQLite database: prodigy_v2.0a1.db. This database will not have any of your previous datasets in it. Before we officially release the v2 Database and default to using the experimental new Feed we will add a migration recipe so all your old data can be ported to the new database seamlessly.

If you are using a MySQL or PostgreSQL database, you will need to follow the URL style configuration of SQLAlchemy. We'll pass this URL directly to the SQLAlchemy create_engine function. Engine Configuration — SQLAlchemy 1.4 Documentation.

For example, if you have a PostgreSQL database setup and want to use the new Feed, you will need to make the following changes to your prodigy.json

From

// old prodigy.json
{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "prodigy",
      "user": "username",
      "password": "xxx"
    }
  }
}

To

// new prodigy.json
{
  "experimental_feed": true,
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "url": "postgresql://username:xxx@localhost:5432/prodigy_v2"
    }
  }
}

And that's it! Your annotation workflow shouldn't need to change, multi-annotator and feed_overlap scenarios should work the same.

If you do try it out please respond to this thread and let us know how it's working. It's still early in the development process so apologies if you encounter issues, specifically with databases besides the default SQLite db.

Topic		Replies	Views
Duplicates in revised annotations usage	2	574	May 29, 2019
Tasks are duplicated	3	436	June 7, 2023
Duplicated prodigy output in json database , solved	9	670	December 11, 2019
Adding data to a Prodigy dataset using db-in - is there a way to filter out/remove duplicate annotations? usage , solved	2	417	January 4, 2023
Keeping Duplicates in Stream textcat , solved	3	120	April 25, 2024

Duplicate annotations in output

Overview

Installation

Usage

Related topics