I was able to resolve the duplicate examples issue by using Prodigy 1.11.8a4, migrating all our datasets to the v2 DB and changing the 'prodigy.json' and 'recipes' accordingly. This is the ticket that logs our correspondence for this change:
Now I would like to upgrade to the latest Prodigy version. I have looked at the release notes for Prodigy v1.11.9 and I am not sure what changes need to be made. It seems the experimental feed with v2 DB option and SQLAlchemy was not carried over to the following releases. The 'url' key in the prodigy.json seems to be throwing an error: peewee.ProgrammingError: invalid dsn: invalid connection option "url"
Even though the experimental approach we were testing in 1.11.8a2 was solving some of the issues related to duplicates, overall it was not an optimal solution and we have decided to abandon this path and go for a more comprehensive feed refactoring. All stable releases following the experimental 1.11.8a2 use the same database setup as 1.11.8 i.e. v1 DB as documented here.
This means that we're back to using peewee which is why the url setting in your configuration file is no longer valid. The pattern for postgres setup in v1 DB is the following:
In order to port the data you currently store in the experimental v2 DB, you'd have to migrate it back to v1 DB. I understand it's a bit of a nuisance but, depending on your setting of course, you should be able to automate that with a bash script.
The steps would be as follows:
Make a backup copy of your current v2 DB
Export all the datasets to disk. If you store all dataset names that you want to migrate in a datasets.txt file (one name per line), you could use a bash script like this to store all datasets on disk in my_folder (you can get a list of all datasets with prodigy db stats -lscommand)
cat datasets.txt | while read line; do python -m prodigy db-out "$line" my_folder; done
Start a fresh virtual environment and install the latest prodigy version which will be 1.11.14
Import all the datasets stored in my_folder to the v1 DB. Again, you could use a bash script like this:
cat datasets.txt | while read line; do python -m prodigy db-in "$line" my_folder/"$line".jsonl ; done
Hey @magdaaniol, this will be more than a bit of a nuisance for us. I'm disappointed with the support quality tbh. It was a lot of (necessary) effort for us to get our data into V2 and being asked to move back is discouraging.
Hi @alexf_a,
I do understand the frustration. I must reiterate though that we released 1.11.8a4 as explicitly experimental and we did point out that the v2 DB schema is different from v1 schema, which is why the extra migration script was needed.
With hindsight, I think we perhaps shouldn't have named it DB v2 or warn users even more explicitly that v1 and v2 DBs are incompatible.
I can only say that at that point in time we thought that was the direction we'll take but for important reasons this solution never made it to stable.
I feel really sorry that you've developed dependence on an alpha version and I admit that a better consultancy would have made you reconsider this decision if the migration process was costly.
If there's anything we can do to help with data migration, let us know please.