Old examples are automatically added to new dataset

WeinstockShahar · January 16, 2018, 12:41pm

When I create a new dataset I somehow always have examples from an old dataset linked to it:

from prodigy.components.db import connect
db = connect('sqlite', {'name': 'industry_db.db'})
db.add_dataset('complitly_new_dataset')
len(db.get_dataset('complitly_new_dataset'))
> 7149

I can’t find a way to create a new dataset or delete those examples. Do you know what can cause this issue?

ines · January 16, 2018, 12:48pm

This is strange! Did you confirm that the new dataset is indeed new and doesn’t yet exist in the database?

assert 'new_dataset' not in db

If the dataset already exists, and wasn’t removed via db.drop_dataset, db.add_dataset will return the already existing dataset and will add an entry about this to the log. (Maybe this is bad default behaviour, though.) You can also run your code with the environment variable PRODIGY_LOGGING=basic to see more details on what Prodigy is doing behind the scenes.

WeinstockShahar · January 16, 2018, 12:57pm

This is strange indeed.
I confirmed that the dataset doesn’t already exist and it is still happening…
I even deleted the original dataset of those examples and they keep coming back. Is there a place where those examples might be saved? If so, maybe I can “clear cache” and see if it keeps happening.

ines · January 16, 2018, 1:04pm

SQLite is pretty straightforward in this way, so you could simply create a new database file by using a different name instead of 'industry_db.db'. This will create a new, clean database. If you don’t have too many datasets, you can export them via db-out, and then import them to the new database via db-in.

One possible explanation I can think of is that something went wrong when linking the annotated examples to the dataset. Internally, an example with the same hash is only stored once, and then linked to one or more datasets. So if examples weren’t unlinked correctly, and you re-add a set that existed before, it might end up still linked to those examples. (But that still wouldn’t explain why a completely new set ended up with examples.)

WeinstockShahar · January 16, 2018, 1:26pm

db = connect('sqlite', {'name': 'entirely_new_db.db'})
db.datasets
> [a list of all the old datasets]

Am I doing this right? because the entirely_new_db.db have all the datasets of industry_db.db.
I'm not sure if I'm doing something wrong or it's just the weird issue continuing here...
Do I need to delete the prodigy.db file and create a new one?

ines · January 16, 2018, 1:41pm

No, you’re definitely doing this correctly. I think you might have actually discovered a bug that causes the database settings to not overwrite the defaults / pre-defined settings in the prodigy.json. So it’s not respecting your custom database name, and only using the default prodigy.db or whatever else is specified in your prodigy.json.

Can you check your prodigy home directory (by default ~/.prodigy) and check which .db files are in there? And, as a workaround, try setting your SQLite config in the prodigy.json:

{
    "db": "sqlite",
    "db_settings": {
        "name": "industry_db.db"
    }
}

And then just call:

db = connect()

WeinstockShahar · January 16, 2018, 2:16pm

The only db file in the prodigy home directory is prodigy.db.
Unfortunately, when I set my SQLite in the prodigy.json nothing new happens, the problem is the same.

ines · January 16, 2018, 4:10pm

Okay, so this at least confirms my suspicion that the settings aren’t overwritten correctly when using connect(). Will investigate this and fix whatever caused the problem. (Sorry about that!)

But I also just tested it with a new database name in the prodigy.json, and it worked fine for me. Prodigy created a new .db file in the directory, and the db.datasets were empty. What does your prodigy.json look like? Did you use the exact config I posted above?

Edit: Forgot to add – as a workaround you can always just rename prodigy.db to something else, and it will create a new prodigy.db. This isn’t very satisfying though, but if everything else fails, this will at least keep you working.

WeinstockShahar · January 21, 2018, 12:16pm

Just wanted to update that the workaround worked for me and in the new DB the datasets are not overwritten. I still can't create a new .db file using the prodigy.json though.

Thanks a lot for your help!

cmgreivel · March 21, 2018, 6:45am

I’m seeing this behavior too. Unfortunately, with our workflow we were hoping to be able to reuse dataset names, so switching to a new database file isn’t a great option for us. Here’s a simple set of commands that demonstrates the issue.

from prodigy.components.db import connect
from prodigy import set_hashes
examples = [{'text': 'Example1', 'label': 'Nice', 'answer': 'accept'},
... {'text': 'Example2', 'label': 'Nice', 'answer': 'reject'}]
examples = [set_hashes(eg) for eg in examples]
db = connect()
23:32:52 - DB: Initialising database SQLite
23:32:52 - DB: Connecting to database SQLite
assert 'cmgtest' not in db
db.add_dataset('cmgtest')
23:33:03 - DB: Creating dataset 'cmgtest'
<prodigy.components.db.Dataset object at 0x110df2080>
db.add_examples(examples, ['cmgtest'])
23:33:10 - DB: Getting dataset 'cmgtest'
23:33:10 - DB: Added 2 examples to 1 datasets
print(len(db.get_dataset('cmgtest')))
23:33:16 - DB: Loading dataset 'cmgtest' (2 examples)
2
db.drop_dataset('cmgtest')
23:33:30 - DB: Removed dataset 'cmgtest'
True
assert 'cmgtest' not in db
db.add_dataset('cmgtest')
23:33:43 - DB: Creating dataset 'cmgtest'
<prodigy.components.db.Dataset object at 0x110df20b8>
print(len(db.get_dataset('cmgtest')))
23:33:51 - DB: Loading dataset 'cmgtest' (1 examples)
1
print(db.get_dataset('cmgtest'))
23:34:00 - DB: Loading dataset 'cmgtest' (1 examples)
[{'label': 'Nice', '_input_hash': 1582969015, 'answer': 'reject', '_task_hash': 19451014, 'text': 'Example2'}]
db.add_examples(examples, ['cmgtest'])
23:34:18 - DB: Getting dataset 'cmgtest'
23:34:18 - DB: Added 2 examples to 1 datasets
print(len(db.get_dataset('cmgtest')))
23:34:29 - DB: Loading dataset 'cmgtest' (3 examples)
3
print(db.get_dataset('cmgtest'))
23:34:33 - DB: Loading dataset 'cmgtest' (3 examples)
[{'label': 'Nice', '_input_hash': 1582969015, 'answer': 'reject', '_task_hash': 19451014, 'text': 'Example2'}, {'label': 'Nice', '_input_hash': -544789127, 'answer': 'accept', '_task_hash': 1326324553, 'text': 'Example1'}, {'label': 'Nice', '_input_hash': 1582969015, 'answer': 'reject', '_task_hash': 19451014, 'text': 'Example2'}]
db.drop_dataset('cmgtest')
23:34:37 - DB: Removed dataset 'cmgtest'
True
assert 'cmgtest' not in db
db.add_dataset('cmgtest')
23:34:47 - DB: Creating dataset 'cmgtest'
<prodigy.components.db.Dataset object at 0x110df2a20>
print(len(db.get_dataset('cmgtest')))
23:34:51 - DB: Loading dataset 'cmgtest' (3 examples)
3
db.add_examples(examples, ['cmgtest'])
23:34:58 - DB: Getting dataset 'cmgtest'
23:34:58 - DB: Added 2 examples to 1 datasets
print(len(db.get_dataset('cmgtest')))
23:35:03 - DB: Loading dataset 'cmgtest' (5 examples)
5

Is there any way to wipe out the examples too? The examples only belong to that dataset, and when I drop_dataset() I’m happy to remove all the associated examples.

ines · March 21, 2018, 7:03am

Thanks a lot for the detailed report and example and sorry about the frustration!

I’m pretty sure we fixed a bug for Prodigy v1.4.1 that was related to the hashing and ended up causing of this issue. We’re just testing the new version and will release the update asap!

cmgreivel · March 21, 2018, 4:50pm

Excellent. Thanks for the quick response.

ines · March 26, 2018, 2:57pm

Just released v1.4.1, which includes a fix to the hashes that ensures that the database is able to always get the correct hashes for a given dataset. This should hopefully also resolve this issue!

andy · May 31, 2018, 3:05pm

I’m still getting this behavior with 1.4.2. If I drop a dataset and then create a new dataset with the same name, all of the examples that were in that dataset reappear.

honnibal · June 1, 2018, 2:20pm

@andy Thanks for the heads up. Will definitely investigate.

akshitasood63 · March 25, 2019, 11:21am

@ines
Hi,
I also faced a similar issue recently. Prodigy creates a new dataset and then loads some examples which should not have been in the dataset. Moreover these examples are not even present in the stream.
Here is a demonstration of what is happening :

Topic		Replies	Views
Dropping dataset from code doesn't properly delete examples done , database	12	3194	June 5, 2020
Old dataset ame with new input file usage , database , solved	1	361	August 21, 2020
How do tables map to datasets in prodigy DB? database , solved	2	731	December 13, 2019
terms.to-patterns adds data from a wrong dataset done , database , terms	1	814	March 21, 2018
Deleting examples from DB usage , database	9	2180	October 14, 2019

Old examples are automatically added to new dataset

Related topics