Dropping dataset from code doesn't properly delete examples

daniel · March 25, 2018, 4:48pm

There is an unexpected behavior when adding/dropping datasets and adding samples from code. It seems like Prodigy doesn’t properly remove/unlink examples.

Consider the following code:

import prodigy
from prodigy.components.db import connect
from prodigy.util import set_hashes

seed_examples = [
	{"text": "Example 1"},
	{"text": "Example 2"},
]
seed_examples = [set_hashes(eg) for eg in seed_examples]

db = connect()

# add dataset and samples
dataset = db.add_dataset("first_dataset")
db.add_examples(seed_examples, ["first_dataset"])

# there should be 2 samples in first dataset
examples = db.get_dataset("first_dataset")
print("Examples in first dataset: {}".format(len(examples)))

# drop the first dataset
db.drop_dataset("first_dataset")

# add second dataset
dataset = db.add_dataset("second_dataset")

# there should be no samples in second dataset
examples = db.get_dataset("second_dataset")
print("Examples in second dataset: {}".format(len(examples)))

# add 2 samples
db.add_examples(seed_examples, ["second_dataset"])

# there should be 2 samples in second dataset
examples = db.get_dataset("second_dataset")
print("Examples in second dataset: {}".format(len(examples)))

Running this script on a brand new prodigy database prints:

Examples in first dataset: 2
Examples in second dataset: 1
Examples in second dataset: 3

Running it again:

Examples in first dataset: 2
Examples in second dataset: 2
Examples in second dataset: 4

And third time:

Examples in first dataset: 4
Examples in second dataset: 4
Examples in second dataset: 6

The expected behavior is that when dropping a dataset, it should unlink/delete all the samples associated with that dataset. Or am I missing something?

When dropping dataset with the drop recipe, Prodigy also doesn’t delete the samples and links (it only deletes the first record) but stats reports an empty dataset (I guess due to session ids). But I noticed that it sometimes can lead to incorrect sample count as well (can’t reproduce it consistently).

By design, should the drop function/recipe remove all records of that dataset and its sessions from the database (that naive expected behavior is that it should)?

Thanks.

ines · March 26, 2018, 3:12pm

Thanks a lot for the report! We just released v1.4.1, which fixes a problem with the hashing that would sometimes return incorrect results when fetching all hashes for a dataset. A similar issue has been reported on this thread and I suspect the hashing problem might have been the cause of that. It’s also likely that the incorrect sample count you’ve come across had something to do with this.

Re deleting / unlinking examples: Because all examples are added to both the regular dataset and session dataset they’re always linked to at least two sets – potentially more, if you annotate the same examples more than once across sets. The way deleting and unlinking is currently handled isn’t 100% ideal, and we’re working on a better solution.

daniel · March 26, 2018, 6:55pm

Nope, after updating to v1.4.1 and running the code I posted above, it still gives the same incorrect results.

Checking the database itself with SQL browser shows that the examples are not deleted, neither the links were removed from the links table. Each iteration adds duplicate examples with same input and task hashes. And since adding datasets reuses the same ids, the linked examples increase each time a dataset is being dropped/re-added.

You can replicate it by running the code I posted.

Superscope · March 5, 2019, 8:58am

Hi there when I drop datasets, prodigy stats -ls still shows sessions. And when I inspect the database directly there are still tables present. i can purge the DB manually, but is there a way to do this from the command line? Hope this is the right thread for this. Thanks much.

ines · March 5, 2019, 9:19am

@Superscope Internally, sessions are also datasets and the individual examples you’re annotating are linked to both a regular dataset and a session dataset (and potentially other datasets). If I remember correctly, Prodigy isn’t currently purging empty sessions automatically – but you should be able to delete them just like regular datasets, by using the session name (timestamp).

Alternatively, you can also write a custom Python script that retrieves and checks the sessions and deletes them (e.g. if they’re empty or were created before a certain date etc.). Here’s an example (untested, but should work):

from prodigy.components.db import connect

db = connect()  # uses settings from your prodigy.json
for session in db.sessions:
    # Remove session if it's empty – or implement your own
    # custom logic here
    examples = db.get_dataset(session)
    if not examples:
        db.drop_dataset(session)

For more details on the Python API, check out the Database docs in your PRODIGY_README.html.

Superscope · March 5, 2019, 2:18pm

understood…thanks

Nick · October 23, 2019, 10:21am

When I drop a dataset using 'prodigy drop' the examples, link and sessions aren't deleted from the database (MySQL). Or are those the 'empty sessions' you mentioned @ines ?

For example:
1 - I create a new and empty database.
2 - prodigy dataset test
3 - Then I annotate one example for 'test'
4 - prodigy drop test
5 - The result:

DB > dataset

{
	"data":
	[
		{
			"id": 2,
			"name": "2019-10-23_12-07-14",
			"created": 1571825234,
			"meta": "BLOB",
			"session": 1
		}
	]
}

DB > example

{
	"data":
	[
		{
			"id": 1,
			"input_hash": 969126593,
			"task_hash": -946968246,
			"content": "BLOB"
		}
	]
}

DB > link

    {
	"data":
	[
		{
			"id": 2,
			"example_id": 1,
			"dataset_id": 2
		}
	]
}

geniki · May 28, 2020, 10:28pm

Hello, I have a similar issue:

Create dataset1 from a json file
Annotate a few examples
prodigy drop dataset1
Create dataset1 again using the same command as in 1. expecting that annotation will start from the beginning but the examples annotated in 2. don't show up for annotation
If I create dataset2 (any different name) from the same json file behaviour seems to be as expected - i.e. annotation starts from scratch. However, I'd like to use the dataset1 name (to follow an existing naming convention).

Am I missing something or indeed this has not been resolved yet?

ines · May 29, 2020, 10:00am

@geniki Thanks for the report, that's strange! Are you using the latest version of Prodigy? And are you using the default SQLite database setup? Also, when you run prodigy stats dataset1 after recreating it, does it show the dataset as empty?

geniki · May 29, 2020, 11:27am

I'm using Prodigy 1.9.9 on Windows with the default database setup, named session (same name every time).

When I run prodigy stats dataset1 after re-creating, it shows 0 annotations and the right timestamp (i.e. most recent) for dataset creation.

The recipe is custom, involves re-setting hashes (possibly incorrectly, could this be a reason?) and the input json file is the result of db-out after a first-pass annotation through the data using ner.correct.

ines · May 29, 2020, 1:35pm

@geniki Thanks for the updates! As a quick sanity check, could you run the following after dropping dataset1:

from prodigy.components.db import connect

db = connect()
print(len(db.get_task_hashes("dataset1")))

If this reports anything other than 0, it means that for some (very confusing) reason, Prodigy still thinks there are hashes associated with that dataset, which would cause it to skip examples with those hashes.

In that case, you could set "auto_exclude_current": false temporarily in your config until you're re-annotated all the examples it skipped. It will basically disable skipping examples present in the current dataset, so Prodigy will send everything out again, even if it thinks it already has the hash in the data.

If the above reports 0, it would mean that Prodiy's exclude logic knows there are no existing task hashes and then it's more likely that there's something in the recipe that causes the examples to be skipped.

geniki · June 1, 2020, 9:47am

Tried the above with no success.

Looking at the verbose Prodigy logs, it seems that named sessions create their additional separate "dataset" in the DB. prodigy drop dataset1 doesn't delete dataset1-user1 and that's what Prodigy picks up when I "re-create" the dataset. So to actually delete a dataset, I need to 1) look up all associated "datasets" and 2) manually drop them one by one. Is this the general and desired behaviour?

ines · June 5, 2020, 7:00pm

Update:

textcat.manual Duplicate Samples

@cgreco thanks for the confirmation, I think I have a workaround for you. The trouble is that when you use named sessions, it creates the source dataset entries, and another "session" dataset of the form "[dataset_name]-[session_id]". So if you used a dataset "foo" and a session "bar" you would end up with examples in the datasets "foo" and "foo-bar". When you drop "foo" it leaves "foo-bar" behind. So in your case, you can work around the problem by dropping the named session datasets in addition to the regular one:
prodigy drop my-dataset
prodigy drop my-dataset-justin

Topic		Replies	Views
Old examples are automatically added to new dataset done , database	15	2042	March 25, 2019
Deleting examples from DB usage , database	9	2180	October 14, 2019
Feature Request: Bulk Dataset Drop enhancement , database	11	1178	February 3, 2023
Doubts about databases usage , database	1	310	August 6, 2021
Old dataset ame with new input file usage , database , solved	1	361	August 21, 2020

Dropping dataset from code doesn't properly delete examples

Related topics