Share datasets between prodi gy users


What are the best ways to share an annotation database between multiple data scientists ?

I guess one way would be to share a common external database.
But it might raise security, connectivity and quality concerns (trial and errors during recipe iterations).

Is the best option to use db-in & db-out to share the data as a jsonl ?
Is there a way to have multiple databases : a local one and a remote one ?

Thanks a lot for your help and prodi gy.

Hi! Having a single shared remote database is probably the most straightforward solution. It doesn't have to be in the cloud – if everyone has access to the same shared drive, you could set the PRODIGY_HOME environment variable so everyone uses the same config and writes to the same SQLite database file.

Is your main motivation for sharing datasets to make sure that examples aren't annotated multiple times, and so you can exclude examples if they're already in the dataset?

That's definitely an option, yes. If passing around files is too messy, a more elegant solution would be to use the database API in Python and write a script that syncs your annotations.

By default, Prodigy expects to write the examples to one database. However, you could write your own custom Database class or implement your own logic to save to a second locations.

You could even have a custom recipe that uses the update callback to send completed annotations (or just their task hashes!) to a remote database. If you're just storing the hashes, you'll likely won't have any data privacy issues – but you can still use them to filter examples and detect whether something is a duplicate or not.

Hi Ines !

My main motivation would be to have a common annotation base : a way for the team to share their datasets and exploit them via prodigy.
At the same time a personal database (from one ds) can contain imperfect annotations or undocumented datasets.
I was wondering if you had any advice on a flow, a format and any good practice to update and download annotations from a shared annotation base.

Thanks for the various tips on the Database api.

Yes, that makes a lot of sense and is a good way to approach datasets IMO. Maybe you could just implement a simple command/recipe that each developer can run to add their annotations to the "master database". Under the hood, that would just use db.get_dataset (personal DB) and then db.add_examples (master DB). Since it's just a Python script, you can do pretty much anything in there – for example, you could even make it send you a Slack notification like "Arnault just added 345 examples to master dataset xyz and said: 'Annotation done! Let me know what you think!'" :sweat_smile:

If you have overlapping annotations (e.g. same data annotated by different people, potentially with conflicts), you might also want to look at the review recipe as a way to create "merged master datasets":

You might also find this comment helpful – it's more about general strategies for annotating and developing together as a team: