Share datasets between prodi gy users

Arnault · May 5, 2020, 12:10pm

Hi,

What are the best ways to share an annotation database between multiple data scientists ?

I guess one way would be to share a common external database.
But it might raise security, connectivity and quality concerns (trial and errors during recipe iterations).

Is the best option to use db-in & db-out to share the data as a jsonl ?
Is there a way to have multiple databases : a local one and a remote one ?

Thanks a lot for your help and prodi gy.

ines · May 5, 2020, 10:02pm

Hi! Having a single shared remote database is probably the most straightforward solution. It doesn't have to be in the cloud – if everyone has access to the same shared drive, you could set the PRODIGY_HOME environment variable so everyone uses the same config and writes to the same SQLite database file.

Is your main motivation for sharing datasets to make sure that examples aren't annotated multiple times, and so you can exclude examples if they're already in the dataset?

That's definitely an option, yes. If passing around files is too messy, a more elegant solution would be to use the database API in Python and write a script that syncs your annotations.

By default, Prodigy expects to write the examples to one database. However, you could write your own custom Database class or implement your own logic to save to a second locations.

You could even have a custom recipe that uses the update callback to send completed annotations (or just their task hashes!) to a remote database. If you're just storing the hashes, you'll likely won't have any data privacy issues – but you can still use them to filter examples and detect whether something is a duplicate or not.

Arnault · May 6, 2020, 10:00pm

Hi Ines !

My main motivation would be to have a common annotation base : a way for the team to share their datasets and exploit them via prodigy.
At the same time a personal database (from one ds) can contain imperfect annotations or undocumented datasets.
I was wondering if you had any advice on a flow, a format and any good practice to update and download annotations from a shared annotation base.

Thanks for the various tips on the Database api.

ines · May 7, 2020, 11:48am

Yes, that makes a lot of sense and is a good way to approach datasets IMO. Maybe you could just implement a simple command/recipe that each developer can run to add their annotations to the "master database". Under the hood, that would just use db.get_dataset (personal DB) and then db.add_examples (master DB). Since it's just a Python script, you can do pretty much anything in there – for example, you could even make it send you a Slack notification like "Arnault just added 345 examples to master dataset xyz and said: 'Annotation done! Let me know what you think!'"

If you have overlapping annotations (e.g. same data annotated by different people, potentially with conflicts), you might also want to look at the review recipe as a way to create "merged master datasets": Built-in Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP

You might also find this comment helpful – it's more about general strategies for annotating and developing together as a team:

Topic		Replies	Views
Annotate multiple JSONL into multiple Datasets usage , database , solved , streams	2	549	October 7, 2021
Is there a way to get a list of all the databases/annotation projects as well as where the databases are saved on the disk? usage , database	1	4918	April 7, 2020
Specify database to use with db-out? usage , database , solved	2	1003	September 24, 2020
How to use the review recipe on two datasets from two different Jupyter notebooks ner , best-practices	2	353	May 3, 2023
Data validation usage , solved	1	954	February 1, 2019

Share datasets between prodi gy users

Related topics