Duplicate annotations in output

Overview

Hi everyone, we are excited to have an alpha version of the new internal Feed system and Database based on SQLAlchemy up on our PyPI index.
If you've been getting duplicate examples shown to annotators, we would love for you to try out the new version and new feed and respond here with any comments/questions/issues.

Installation

The new version is 1.11.8a1 which you can install with:

pip install prodigy==1.11.8a1 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy`

Usage

To turn on the new Feed and new Database, add the experimental_feed setting in your prodigy.json

// prodigy.json
{
    "experimental_feed": true
}

Prodigy currently uses Peewee as the ORM for the Database. The new Feed system stores all examples that are to be shown to annotators in the Database. Naturally this requires some schema changes, so we are taking this opportunity to switch to SQLAlchemy for our ORM. We're excited that this will enable the usage of more database systems beyond SQLite, PostgreSQL, and MySQL however it will change how you set the database settings in your prodigy.json.

If you have solely been using the default SQLite database (i.e. not setting database settings in your prodigy.json at all) we have a new default name for the SQLite database: prodigy_v2.0a1.db. This database will not have any of your previous datasets in it. Before we officially release the v2 Database and default to using the experimental new Feed we will add a migration recipe so all your old data can be ported to the new database seamlessly.

If you are using a MySQL or PostgreSQL database, you will need to follow the URL style configuration of SQLAlchemy. We'll pass this URL directly to the SQLAlchemy create_engine function. Engine Configuration — SQLAlchemy 1.4 Documentation.

For example, if you have a PostgreSQL database setup and want to use the new Feed, you will need to make the following changes to your prodigy.json

From

// old prodigy.json
{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "prodigy",
      "user": "username",
      "password": "xxx"
    }
  }
}

To

// new prodigy.json
{
  "experimental_feed": true,
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "url": "postgresql://username:xxx@localhost:5432/prodigy_v2"
    }
  }
}

And that's it! Your annotation workflow shouldn't need to change, multi-annotator and feed_overlap scenarios should work the same.

If you do try it out please respond to this thread and let us know how it's working. It's still early in the development process so apologies if you encounter issues, specifically with databases besides the default SQLite db.