Duplicate annotations in output

brdlyrbrts · April 7, 2022, 7:05pm

Thanks for the alpha version. I am using Prodigy with Docker and connecting to Azure Database for MySQL. I have made the changes as outlined to prodigy.json, created a new prodigy database, and get the following error:

sqlalchemy.exc.CompileError: (in table 'dataset', column 'name'): VARCHAR requires a length on dialect mysql

Now using sqlalchemy, it looks like length of varchar need to be defined?

kab · April 7, 2022, 7:57pm

Thanks just found that myself, a release is building currently and will be out shortly.

Edit: Version 1.11.8a2 is now up fixing the MySQL issue above.

Install with:

pip install prodigy==1.11.8a2 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy

brdlyrbrts · April 8, 2022, 4:51pm

The new alpha release is working quite nicely in our dev environment. Thanks for the update!

kab · April 8, 2022, 8:06pm

That's great! Thanks! Would you mind sharing a bit about your workflow? Besides using MySQL, do you have multiple annotators? are you using feed_overlap?

brdlyrbrts · April 11, 2022, 1:51pm

Sorry, I should have provided some more details.

We're running Prodigy as a Docker container with multiple annotators connecting through Azure Web App (with Azure MySQL as the database). We have feed_overlap set to false. We did turn off instant_submit since previous versions would still allow annotators to press Undo and return the previous text example, but the alpha version locks the decision in, which is probably how instant_submit should work. Hope that helps!

lukeorland · May 3, 2022, 8:14am

I've been having a very difficult time connecting this new version of prodigy to an existing database.

This is the error I'm seeing now:

...
  File "/usr/local/lib/python3.10/site-packages/pymysql/err.py", line 143, in raise_mysql_exception
    raise errorclass(errno, errval)
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1054, "Unknown column 'dataset.updated' in 'field list'")
[SQL: SELECT dataset.id, dataset.created, dataset.updated, dataset.name, dataset.meta, dataset.session, dataset.feed 
FROM dataset 
WHERE dataset.name = %(name_1)s 
 LIMIT %(param_1)s]
[parameters: {'name_1': 'search-relevance-judgements_2022-04_SUR-520', 'param_1': 1}]

koaning · May 9, 2022, 8:47am

@lukeorland this seems worth diving into, but could you start a new thread on this topic? That way this thread can remain on topic for the annotation duplication and it will also allow us to go in more depth for your specific situation. When you post a new thread, could you also include your operating system, Python version as well as relevant Prodigy versions?

raulsperoni · May 16, 2022, 9:50pm

Hello, I've tried this with no luck. Details here: Duplicated examples over sessions in NER manual

Is there another workaround for this situation?

Thank you

kab · May 19, 2022, 10:07pm

@lukeorland hi the issue is the Databases are incompatible since we changed the structure of the SQL Tables. This alpha version requires using a new Database.

The only thing stopping it from becoming a final version is we're working on a Database Migration script so you can keep old data from the old Database architecture.

alexf_a · August 4, 2022, 3:31pm

Hi @kab, do you have an ETA for the migration script? Alternatively, is there any planned backwards compatibility of the Python API with the old DB schema?

The duplicates-in-feed problem has been a significant burden this year, with around half of our annotator hours being devoted to manually filtering them. We also store hundreds of thousands of records over dozens of PostGres DB's, so will have trouble manually migrating our data.

We're looking forward to trying the experimental_feed option, but want to be sure that our existing labeled datasets will stay safe and consumable.

magdaaniol · August 12, 2022, 9:37am

Hi @alexf_a, we have recently released v1.11.8a4 of the alpha that includes the db-migrate command ( as well as all improvements that come with the main Prodigy release v1.11.8) You can download it from our PyPi index like so:

pip install prodigy==1.11.8a4 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy

In order to use the db-migratecommand, you would need to specify 1) the legacy DB (v1) and 2) the target DB (v2) in your prodigy.json:

{
// prodigy.json
    "experimental_feed": true,
    "feed_overlap": false,
    "legacy_db": "postgresql",
    "legacy_db_settings": {
        "postgresql":{
            "dbname": "prodigy",
            "user": "username",
            "password": "xxx"
        }
    },
    "db": "postgresql",
    "db_settings": {
        "postgresql": {
            "url": "postgresql://username:xxx@localhost:5432/prodigy_v2"
         }
     }
}

The legacy DB settings are processed with peewee so the configuration style is the same as in v1. The target DB, however, should use sqlalchemy style which in the case of postgresql is slightly different as explained here. You can also specify the sqlite or mysql as the target.

With that configuration in place, you should be able to run:
prodigy db-migrate
That should migrate all the datasets from the v1 legacy database to the v2 target database. Also, check prodigy db-migrate --help for more options (migrating a selected dataset, excluding selected datasets, performing a dry run etc.)

mk7exe · October 17, 2022, 9:45pm

@kab

How can we pass extra arguments other than "URL"? For example, MySQL dbs on azure need an SSL certificate. we can pass that to sqlalchemy as

"connect_args": {"ssl-ca": "BaltimoreCyberTrustRoot.crt.pem"}

How can we set this up in the prodigy.json file?

mk7exe · October 20, 2022, 4:40pm

Hi @kab

We upgraded the prodigy to the 1.11.8a1 version and went through upgrading our MySQL database according to the new prodigy table structure. We followed this instruction and we still see duplicate examples in both of these scenarios:

Annotators see the examples that they have already labelled.
Annotators see examples that other annotators have already labelled.

We are using prodigy with a dockerized web app and MySQL database both in Azure.

UPDATE:
We lowered the batch size to 5 and upgraded the MySQL server compute and it seems these steps solve the problem for now.

magdaaniol · January 27, 2023, 5:43pm

Hi,
Just wanted to update everyone experiencing issues with duplicates in the example stream that we have recently released 1.11.9 version that fixes the bug that was causing it. Since, it's a bug fix rather than a refactor we were testing in the experimental 1.11.8ax versions, we released it as a patch on the previous official Prodigy version. This means that the DB setup is the same as it was for 1.11.8. Please check the 1.11.9release post for more details on what to expect. We really appreciated your patience while working on it.

Topic		Replies	Views
Duplicates in revised annotations usage	2	574	May 29, 2019
Tasks are duplicated	3	438	June 7, 2023
Duplicated prodigy output in json database , solved	9	670	December 11, 2019
Adding data to a Prodigy dataset using db-in - is there a way to filter out/remove duplicate annotations? usage , solved	2	417	January 4, 2023
Keeping Duplicates in Stream textcat , solved	3	120	April 25, 2024

Duplicate annotations in output

Related topics