Duplicate annotations in output

Hello,
I created a custom recipe to allow a multi-user annotation, and every audio file is presented to every annotator (so with feed_overlap: True). I am using Prodigy 1.11.7 and I ran into the same duplicates issue.

After trying a lot of things, it seems that setting batch_size: 1 and instant_submit: True fixed the issue.

Thanks for sharing, good to know you're still seeing issues in Prodigy 1.11.7. I'm investigating this. And thanks for sharing this solution you found. I do think that's a good workaround for those having this issue while we work out a more stable solution.

Hi Kabir, please keep us in the loop when this issue is resolved. Thank you so much!

Hi all, sorry for the delay here. We're working on an experimental feature that will be available via feature flag in the next Prodigy release. It's a totally new approach to our internal Feed system and hopefully resolves these issues for good.

We're planning to get it released in the next 2-3 weeks (perhaps in an alpha release sooner than that).

Will update this thread once that version is released and folks can try it out.

2 Likes

Hi everyone,

We've been working hard on this new internal Feed system and I just wanted to update everyone waiting on a fix here. We're really close to releasing the alpha version and plan on doing that next week.

Thanks for your patience.

-Kabir

3 Likes

Overview

Hi everyone, we are excited to have an alpha version of the new internal Feed system and Database based on SQLAlchemy up on our PyPI index.
If you've been getting duplicate examples shown to annotators, we would love for you to try out the new version and new feed and respond here with any comments/questions/issues.

Installation

The new version is 1.11.8a1 which you can install with:

pip install prodigy==1.11.8a1 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy`

Usage

To turn on the new Feed and new Database, add the experimental_feed setting in your prodigy.json

// prodigy.json
{
    "experimental_feed": true
}

Prodigy currently uses Peewee as the ORM for the Database. The new Feed system stores all examples that are to be shown to annotators in the Database. Naturally this requires some schema changes, so we are taking this opportunity to switch to SQLAlchemy for our ORM. We're excited that this will enable the usage of more database systems beyond SQLite, PostgreSQL, and MySQL however it will change how you set the database settings in your prodigy.json.

If you have solely been using the default SQLite database (i.e. not setting database settings in your prodigy.json at all) we have a new default name for the SQLite database: prodigy_v2.0a1.db. This database will not have any of your previous datasets in it. Before we officially release the v2 Database and default to using the experimental new Feed we will add a migration recipe so all your old data can be ported to the new database seamlessly.

If you are using a MySQL or PostgreSQL database, you will need to follow the URL style configuration of SQLAlchemy. We'll pass this URL directly to the SQLAlchemy create_engine function. Engine Configuration — SQLAlchemy 1.4 Documentation.

For example, if you have a PostgreSQL database setup and want to use the new Feed, you will need to make the following changes to your prodigy.json

From

// old prodigy.json
{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "prodigy",
      "user": "username",
      "password": "xxx"
    }
  }
}

To

// new prodigy.json
{
  "experimental_feed": true,
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "url": "postgresql://username:xxx@localhost:5432/prodigy_v2"
    }
  }
}

And that's it! Your annotation workflow shouldn't need to change, multi-annotator and feed_overlap scenarios should work the same.

If you do try it out please respond to this thread and let us know how it's working. It's still early in the development process so apologies if you encounter issues, specifically with databases besides the default SQLite db.

Thanks for the alpha version. I am using Prodigy with Docker and connecting to Azure Database for MySQL. I have made the changes as outlined to prodigy.json, created a new prodigy database, and get the following error:

sqlalchemy.exc.CompileError: (in table 'dataset', column 'name'): VARCHAR requires a length on dialect mysql

Now using sqlalchemy, it looks like length of varchar need to be defined?

1 Like

Thanks just found that myself, a release is building currently and will be out shortly.

Edit: Version 1.11.8a2 is now up fixing the MySQL issue above.

Install with:

pip install prodigy==1.11.8a2 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy
1 Like

The new alpha release is working quite nicely in our dev environment. Thanks for the update!

2 Likes

That's great! Thanks! Would you mind sharing a bit about your workflow? Besides using MySQL, do you have multiple annotators? are you using feed_overlap?

Sorry, I should have provided some more details.

We're running Prodigy as a Docker container with multiple annotators connecting through Azure Web App (with Azure MySQL as the database). We have feed_overlap set to false. We did turn off instant_submit since previous versions would still allow annotators to press Undo and return the previous text example, but the alpha version locks the decision in, which is probably how instant_submit should work. Hope that helps!

1 Like

I've been having a very difficult time connecting this new version of prodigy to an existing database.

This is the error I'm seeing now:

...
  File "/usr/local/lib/python3.10/site-packages/pymysql/err.py", line 143, in raise_mysql_exception
    raise errorclass(errno, errval)
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1054, "Unknown column 'dataset.updated' in 'field list'")
[SQL: SELECT dataset.id, dataset.created, dataset.updated, dataset.name, dataset.meta, dataset.session, dataset.feed 
FROM dataset 
WHERE dataset.name = %(name_1)s 
 LIMIT %(param_1)s]
[parameters: {'name_1': 'search-relevance-judgements_2022-04_SUR-520', 'param_1': 1}]

@lukeorland this seems worth diving into, but could you start a new thread on this topic? That way this thread can remain on topic for the annotation duplication and it will also allow us to go in more depth for your specific situation. When you post a new thread, could you also include your operating system, Python version as well as relevant Prodigy versions?

Hello, I've tried this with no luck. Details here: Duplicated examples over sessions in NER manual

Is there another workaround for this situation?

Thank you

@lukeorland hi the issue is the Databases are incompatible since we changed the structure of the SQL Tables. This alpha version requires using a new Database.

The only thing stopping it from becoming a final version is we're working on a Database Migration script so you can keep old data from the old Database architecture.

Hi @kab, do you have an ETA for the migration script? Alternatively, is there any planned backwards compatibility of the Python API with the old DB schema?

The duplicates-in-feed problem has been a significant burden this year, with around half of our annotator hours being devoted to manually filtering them. We also store hundreds of thousands of records over dozens of PostGres DB's, so will have trouble manually migrating our data.

We're looking forward to trying the experimental_feed option, but want to be sure that our existing labeled datasets will stay safe and consumable.

Hi @alexf_a, we have recently released v1.11.8a4 of the alpha that includes the db-migrate command ( as well as all improvements that come with the main Prodigy release v1.11.8) You can download it from our PyPi index like so:

pip install prodigy==1.11.8a4 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy

In order to use the db-migratecommand, you would need to specify 1) the legacy DB (v1) and 2) the target DB (v2) in your prodigy.json:

{
// prodigy.json
    "experimental_feed": true,
    "feed_overlap": false,
    "legacy_db": "postgresql",
    "legacy_db_settings": {
        "postgresql":{
            "dbname": "prodigy",
            "user": "username",
            "password": "xxx"
        }
    },
    "db": "postgresql",
    "db_settings": {
        "postgresql": {
            "url": "postgresql://username:xxx@localhost:5432/prodigy_v2"
         }
     }
}

The legacy DB settings are processed with peewee so the configuration style is the same as in v1. The target DB, however, should use sqlalchemy style which in the case of postgresql is slightly different as explained here. You can also specify the sqlite or mysql as the target.

With that configuration in place, you should be able to run:
prodigy db-migrate
That should migrate all the datasets from the v1 legacy database to the v2 target database. Also, check prodigy db-migrate --help for more options (migrating a selected dataset, excluding selected datasets, performing a dry run etc.)

@kab

How can we pass extra arguments other than "URL"? For example, MySQL dbs on azure need an SSL certificate. we can pass that to sqlalchemy as

"connect_args": {"ssl-ca": "BaltimoreCyberTrustRoot.crt.pem"}

How can we set this up in the prodigy.json file?

Hi @kab

We upgraded the prodigy to the 1.11.8a1 version and went through upgrading our MySQL database according to the new prodigy table structure. We followed this instruction and we still see duplicate examples in both of these scenarios:

  • Annotators see the examples that they have already labelled.
  • Annotators see examples that other annotators have already labelled.

We are using prodigy with a dockerized web app and MySQL database both in Azure.

UPDATE:
We lowered the batch size to 5 and upgraded the MySQL server compute and it seems these steps solve the problem for now.

Hi,
Just wanted to update everyone experiencing issues with duplicates in the example stream that we have recently released 1.11.9 version that fixes the bug that was causing it. Since, it's a bug fix rather than a refactor we were testing in the experimental 1.11.8ax versions, we released it as a patch on the previous official Prodigy version. This means that the DB setup is the same as it was for 1.11.8. Please check the 1.11.9release post for more details on what to expect. We really appreciated your patience while working on it.

1 Like