Issue with Prodigy 1.11.8a2 with "experimental_feed"

The prodigy json is as follows:

{
"experimental_feed": true,
"host":"0.0.0.0",
"db":"postgresql",
"db_settings":{
        "postgresql":{
                "url":"postgresql://{username}:{password}@{aws_rds_address}/db_name"
        }
},
"split_sents":false,
"feed_overlap":false
}

When I run prodigy stats with this config I get:

Traceback (most recent call last):
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UndefinedColumn: column dataset.feed does not exist
LINE 3: WHERE dataset.session = false AND dataset.feed = false ORDER...
                                          ^


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 422, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/prodigy/recipes/commands.py", line 47, in stats
    "total_datasets": len(DB.datasets),
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/prodigy/components/db_v2/core.py", line 141, in datasets
    return self.get_dataset_names()
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/prodigy/components/db_v2/core.py", line 155, in get_dataset_names
    result = self._session.execute(query)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1712, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
    ret = self._execute_context(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/python38/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedColumn) column dataset.feed does not exist
LINE 3: WHERE dataset.session = false AND dataset.feed = false ORDER...
                                          ^

[SQL: SELECT dataset.name 
FROM dataset 
WHERE dataset.session = false AND dataset.feed = false ORDER BY dataset.created]
(Background on this error at: https://sqlalche.me/e/14/f405)

Hi @nomiizz, thanks for giving the experimental_feeda try!
Is the DB you're pointing to in your prodigy.json version 1.0 database? (You can check by running prodigy stats. You will probably have to switch experimental_feed back to false so that it doesn't throw the same error).
If it is, it won't be compatible with the experimental_feed, as we have changed the DB structure (hence the missing Column error).
We are working on the prodigy command to automate the migration of the datasets from version 1.0 to 2.0 and it will be out soon. In the meantime, you should be able to do a sequence of db_outwith "experimental_feed"= false to serialize the data from the version 1.0 database and then do db_in with "experimental_feed"=trueto load it into version 2.0 database.

Hello

With the release of v1.11.8 is the database migration script incorporated into it? If yes, how should I go about it since we have a lot of data in our postgres servers using the previous version of the database. We have had issues with duplicate examples for months now and quite frankly our annotators are quite annoyed so I am looking to incorporate this new release at the earliest. Thanks

Hi @nomiizz,
We have just releasedv1.11.8a4of the alpha that includes thedb-migratecommand ( as well as all improvements that come with the main Prodigy releasev1.11.8)
You can downloadv1.11.8a4 from our PyPi index like so:

pip install prodigy==1.11.8a4 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy

In order to use thedb-migratecommand, you need to specify 1) the legacy DB (v1) and 2) the target DB (v2) in your prodigy.json:

{
// prodigy.json
    "experimental_feed": true,
    "feed_overlap": false,
    "legacy_db": "postgresql",
    "legacy_db_settings": {
        "postgresql":{
            "dbname": "prodigy",
            "user": "username",
            "password": "xxx"
        }
    },
    "db": "postgresql",
    "db_settings": {
        "postgresql": {
            "url": "postgresql://username:xxx@localhost:5432/prodigy_v2"
         }
     }
}

The legacy DB settings are processed with peewee so the configuration style is the same as in v1. The target DB, however, should usesqlalchemy style which in the case of postgresql is slightly different as explained here. You can also specify the sqlite or mysql as the target.

With that configuration in place, you should be able to run:

prodigy db-migrate

That should migrate all the datasets from the v1 legacy database to the v2 target database. Also, check
prodigy db-migrate --help for more options (migrating a selected dataset, excluding selected datasets, performing a dry run etc.)

1 Like

@magdaaniol thanks for the update. Although I am still confused how this would work in my case. We work with multiple customers data and all datasets for each customer is stored in their own database in Prodigy. What I don't understand is that how does the v2 database for a customer gets created. I tried adding "_v2" at the end of the customer database name in the sqlalchemy URL format, however, that gives me "database 'customername_v2' does not exist" error with the migration recipe. Previously we had been creating customer databases through our recipes using:

from playhouse.postgres_ext import PostgresqlExtDatabase
from prodigy.components.db import Database

psql_db = PostgresqlExtDatabase(db, user="username", password=pw, host=host)
db = Database(psql_db, "postgresql", "Custom PostgreSQL Database")

and then use it something like:

@prodigy.recipe('custom-recipe')
def custom_recipe():
    return {'db': db}  # etc.

Let me know how to go about this?

Hey @nomiizz,

how does the v2 database for a customer gets created.

Actually, with the experimental_feed, a postgres database should be created manually on the postgresql server before pointing Prodigy to it. There are probably ways around it but, typically, you'd create your database outside your script, because postgres does not allow for creating databases inside transactions, while SQLAlchemy always tries to run queries inside a transaction.
So assuming you want to name your target db prodigy_v2 , you'd CREATE DATABASE prodigy_v2 on your postgresql server before launching Prodigy.
In your snippet psql_db = PostgresqlExtDatabase(db, user="username", password=pw, host=host) , Playhouse is taking care of initializing the peewee database with the name corresponding to db variable.
Because Prodigy passes the postgres url directly to create_engineof SQLAlchemy, the database should already exist there.
Since for the migration recipe to work you need to specify both the legacy and target database in your prodigy.json, each of your Prodigy users should have the .prodigy.json setting as mentioned in my previous message with an assumption that there exists a database named prodigy in your legacy db and there exist a database named prodigy_v2 in your target database (to use arbitrary names as examples).

Hi @magdaaniol

I followed the steps you had mentioned. I went into the postgresql server and created a new DB. I was testing on one of our customer's so I named the new DB to be 'customername_v2' whereas the original was called 'customername'. I used CREATE DATABASE customername_v2; to create it. The prodigy json was set to exactly how you have mentioned. And I am using Prodigy 1.11.8a4. I then ran the DB migration script in dry run mode and got the following error:

File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/prodigy/recipes/commands.py", line 351, in db_migrate
    if set_id in DB and len(DB.get_dataset_examples(set_id)):
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/prodigy/components/db_v2/core.py", line 125, in __contains__
    dataset = self.get_dataset_by_name(name)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/prodigy/components/db_v2/core.py", line 255, in get_dataset_by_name
    return Dataset.want(self._session, name=name)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/prodigy/components/db_v2/models.py", line 61, in want
    result = db.execute(query)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1712, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 332, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
    ret = self._execute_context(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/Users/Nauman.Ahmed/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedColumn) column dataset.updated does not exist
LINE 1: SELECT dataset.id, dataset.created, dataset.updated, dataset...
                                            ^
HINT:  Perhaps you meant to reference the column "dataset.created".

[SQL: SELECT dataset.id, dataset.created, dataset.updated, dataset.name, dataset.meta, dataset.session, dataset.feed 
FROM dataset 
WHERE dataset.name = %(name_1)s 
 LIMIT %(param_1)s]
[parameters: {'name_1': 'customername_aug18_nercorrect', 'param_1': 1}]
(Background on this error at: https://sqlalche.me/e/14/f405)

Hey @nomiizz,
It looks like for some reason your target db configuration is still pointing to the Prodigy v1 database (at least that is how I managed to reproduce your error). Could you please share your prodigy.json so that we are on the same page? Also, if you run prodigy stats -ls, does it fail with psycopg2.errors.UndefinedColumn: column dataset.feed does not exist?
If that is the case, then indeed your customername_v2 has been initialized as a peeweedatabase for some reason. (Maybe this DB was used in a prodigy run with experimental_feed set to false prior to your db-migration trial?)

I would just start fresh, by deleting the current customername_v2and creating a new, empty one.
Assuming that the newly created customername_v2 database has no relations and the following prodigy.json setting:

{
  "experimental_feed": true,
  "legacy_db": "postgresql",
  "legacy_db_settings": {
    "postgresql": {
      "dbname": "customername",
      "user": "xxx",
      "password": "yyy"
    }
  },
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "url": "postgresql://xxx:yyy@{host}:{port}/customername_v2"
    }
  }

}

could you run prodigy stats -ls and share the output, please? Thank you!

Hello @magdaaniol

The migration script in-fact worked. Last time when I created the new database I tried to access it and list the datasets using the prodigy commands, which I think initialized it as a peewee database. As a side note the migration script might be better off with a progress bar. During migration of a large dataset with about 7000 examples it timed out and only 4000 examples were successfully moved.

However, more importantly I am still not able to run a session with the experimental feed option. The problem is what I mentioned before. We are running different sessions for each customer and each customer has their own db. This was achieved by doing the following:

from playhouse.postgres_ext import PostgresqlExtDatabase
from prodigy.components.db import Database

@prodigy.recipe('custom-recipe')
def custom_recipe(customer_db):
    psql_db = PostgresqlExtDatabase(customer_db, user="username", password=pw, host=host)
    db = Database(psql_db, "postgresql", "Custom PostgreSQL Database")

    return {'db': db}  # etc.

Now with the v2 db I cannot use the peewee ORM extensions and create a database object with prodigy. Neither can I use the functions like 'Database.get_dataset' or 'Database.get_input_hashes'. Since I don't know the internal schema of your DB I don't know how to implement any of these methods and more importantly pass the db as an option into the prodigy json from a custom recipe. Please help me in this regard. Thanks

Hey @nomiizz ,
Note taken on the progress bar feature and the timeout issue - thank you for the feedback!

About the initialization of the Database object in the script: in v2, the initialization of the Database (i.e. the creation of Tables etc.) happens outside Database.__init__ and requires access to the Base model which Prodigy does not expose. So, at least for now, the only option is what you're doing already i.e.:

  1. create the database on the postgres server
  2. pass the db settings to Prodigy via prodigy.json

If your main concern is to be able to interact with the Database object programatically, you could do that as follows:

from prodigy.core import get_db # context manager for accessing the DB
with get_db() as DB:
    DB.get_dataset_by_name(name)
    hashes = DB.get_input_hashes()

To be clear, the get_db function is not part of the official API. With v1 Prodigy database the "more correct" way to do that is to use the connect method of the Databaseclass:

# only in v1 database
from prodigy.components.db import connect
DB = connect()
DB.get_dataset_by_name(name)
hashes = DB.get_input_hashes()

That, however, won't work for v2 Prodigy database as the connect function returns a tuple with data to, actually, initialize the Databaseobject and not the Databaseobject itself as it used to be. So for v2, you'd have to go:

# only in v2 database
from prodigy.components.db_v2 import connect, Database
Session, db_id, db_name = connect()
session = Session() # manual session init
prodigy_db = Database(session, db_id, db_name)
prodigy_db.add_dataset("my_dataset")

That said, I recommend you try the get_db method, as manually initializing the session has not been thoroughly tested.
Once you have access to the Databaseobject, the API is the same in v1 and v2, so all the methods that are documented for the Database in the docs should be available.
I'm very glad this conversation came up! Clearly, we should improve the programmatic access to the Prodigy v2 Database.

2 Likes

@magdaaniol Thanks for the help. However, the recommendations you provided me are missing an important requirement. Both the get_db() and connect() functions seem to not accept any inputs to what database I need to connect to. I am fine with creating the v2 database myself on the postgres server but I want access the database programmatically through Prodigy and pass in an argument as to what database I want to connect to dynamically (Not only the one that I have mentioned in the prodigy.json).

To further illustrate my use-case lets say I have 5 customer dbs (customer_1, customer_2, customer_3, customer_4, customer_5). Since I have to move to v2 databases I create v2 databases for each on the server: (customer_1_v2, customer_2_v2, customer_3_v2, customer_4_v2, customer_5_v2). I then use the migration script to move all the datasets to the v2 databases for each customer.

Now I would like to run a recipe and dynamically pass in what db I want to store the working dataset to. This depends on what customer the dataset belongs to. I have this feature in place now by just passing the db name in the recipe and then accessing it through prodigy.components.db and then passing it into the prodigy json on return from the recipe. Is it possible to do something like get_db("customer_3_v2") or connect({"url": "postgresql://xxx:yyy@{host}:{port}/customer_3_v2"}) and not just access the db I have put in the prodigy.json?

Hey @nomiizz ,

Sorry if I wasn't clear enough in my last message. What I meant to say, is that, sadly, it is impossible (at least for now) to programmatically initialize the Databaseobject in v2, which is a necessary step to pass the custom DB as a recipe argument.
You can only connect (using the methods described in my previous message) to the DB that has been initialized by Prodigy on startup.
One more workaround I can think of for your use case would be to override the prodigy.json DB setting via ENV variable on the command line. So, instead of passing the URL in the recipe, you'd run the following command before running the prodigy command:

export PRODIGY_CONFIG_OVERRIDES='{"db": "postgresql","db_settings":{"postgresql":{"url": "postgresql://xxx:yyy@localhost:5432/customer_3_v2"}}}'
prodigy my_recipe customer_3_dataset etc.

Or even as one command:

PRODIGY_CONFIG_OVERRIDES='{"db": "postgresql","db_settings":{"postgresql":{"url": "postgresql://xxx:yyy@localhost:5432/customer_3_v2"}}}'; prodigy my_recipe customer_3_dataset etc.
1 Like

@magdaaniol let me know if this follow-up issue requires a separate thread and I will be happy to do so. For now I will post it here. I have been able to migrate all of our customer datasets to v2 database except for one. I have used the db-migrate command along with the PRODIGY_CONFIG_OVERRIDES environment variable. The problem database was the one I was testing this approach initially before programmatically iterating through all customer databases. The error message I am getting is:

(prodigyvenv3.8) User@C02C42QJMD6N obj_sim_demo_app % prodigy db_migrate -d customer1_Jul23_test

ℹ Loading legacy database: PostgreSQL_legacy and target database:

PostgreSQL

Traceback (most recent call last):

  File "/Users/User/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main

    return _run_code(code, main_globals, None,

  File "/Users/User/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/runpy.py", line 87, in _run_code

    exec(code, run_globals)

  File "/Users/User/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/prodigy/__main__.py", line 61, in <module>

    controller = recipe(*args, use_plac=True)

  File "cython_src/prodigy/core.pyx", line 426, in prodigy.core.recipe.recipe_decorator.recipe_proxy

  File "/Users/User/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/plac_core.py", line 367, in call

    cmd, result = parser.consume(arglist)

  File "/Users/User/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/plac_core.py", line 232, in consume

    return cmd, self.func(*(args + varargs + extraopts), **kwargs)

  File "/Users/User/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/prodigy/recipes/commands.py", line 341, in db_migrate

    meta = LEGACY_DB.get_meta(session)

  File "/Users/User/opt/anaconda3/envs/prodigyvenv3.8/lib/python3.8/site-packages/prodigy/components/db.py", line 304, in get_meta

    meta["created"] = dataset.created

TypeError: 'NoneType' object does not support item assignment

Since this error message is coming from Prodigy internally I am not able to figure out what the cause is.

Hey @nomiizz,
Glad to hear you could automate the migration with the help of the env variable! I'd suggest we continue in this thread because it will be easier for other users to understand the problem in the context of our previous conversation - thanks for asking, though!
About the error: I suspect customer1_Jul23_test is v2 database that you are trying to read as legacy i.e. v1 database. I'll definitely add some checks on that to the command but, in the meantime, let's figure out which version it is. This can be done in two ways:

  1. check its schema directly in Postgres: if it contains the table feed, it's v2.
  2. run prodigy stats on it:
export PRODIGY_CONFIG_OVERRIDES='{"exprimental_feed": "True","db": "postgresql","db_settings":{"postgresql":{"url": "postgresql://user:password@localhost:5432/customer1_Jul23_test"}}}'
prodigy stats

If prodigy stats prints out the message correctly, you can check that Database Version is 2.0. If, however, it fails with the following schema error:WHERE dataset.session = false AND dataset.feed = false ORDER BY dataset.created](Background on this error at: https://sqlalche.me/e/14/f405), then it's v1, and the problem is somewhere else, but let's first confirm the version of the problematic DB.