Doubts about databases

Hi,

I have some doubts about the database structure when I use the recipe drop.

In dataset, the entries that are not sessions are deleted, why the sessions still in the database? Is there any reason for it?

In example, any entry is deleted, so if I upload the same content, it will be duplicated with the same input hash and task_hash. That is something irrelevant or it is better removing the entries of example to keep the database clean?

Thanks, and sorry for my english :blush:

Hi! At the moment, the session datasets aren't always explicitly linked to a given dataset, so you can remove a regular dataset independently from a session dataset. If you don't want any of the session datasets, you could fetch all session dataset names, filter for the timestamp names and then remove them all. This is probably easiest to do by calling into the Database API from Python: Database ¡ Prodigy ¡ An annotation tool for AI, Machine Learning & NLP

While one example can be linked to multiple datasets, you can also have the same example included in different datasets multiple times. There are definitely use cases where you might want this (e.g. if you're creating multiple versions of the same annotation by different people, or if you want multiple and slightly different versions of the same dataset).

I wouldn't worry too much about carefully grooming your database, especially if you're working with text. An individual example isn't that large, so your database will stay relatively small for a long time.

2 Likes