Old examples are automatically added to new dataset

When I create a new dataset I somehow always have examples from an old dataset linked to it:

from prodigy.components.db import connect
db = connect('sqlite', {'name': 'industry_db.db'})
db.add_dataset('complitly_new_dataset')
len(db.get_dataset('complitly_new_dataset'))
> 7149

I can’t find a way to create a new dataset or delete those examples. Do you know what can cause this issue?

This is strange! Did you confirm that the new dataset is indeed new and doesn’t yet exist in the database?

assert 'new_dataset' not in db

If the dataset already exists, and wasn’t removed via db.drop_dataset, db.add_dataset will return the already existing dataset and will add an entry about this to the log. (Maybe this is bad default behaviour, though.) You can also run your code with the environment variable PRODIGY_LOGGING=basic to see more details on what Prodigy is doing behind the scenes.

This is strange indeed.
I confirmed that the dataset doesn’t already exist and it is still happening…
I even deleted the original dataset of those examples and they keep coming back. Is there a place where those examples might be saved? If so, maybe I can “clear cache” and see if it keeps happening.

SQLite is pretty straightforward in this way, so you could simply create a new database file by using a different name instead of 'industry_db.db'. This will create a new, clean database. If you don’t have too many datasets, you can export them via db-out, and then import them to the new database via db-in.

One possible explanation I can think of is that something went wrong when linking the annotated examples to the dataset. Internally, an example with the same hash is only stored once, and then linked to one or more datasets. So if examples weren’t unlinked correctly, and you re-add a set that existed before, it might end up still linked to those examples. (But that still wouldn’t explain why a completely new set ended up with examples.)

db = connect('sqlite', {'name': 'entirely_new_db.db'})
db.datasets
> [a list of all the old datasets]

Am I doing this right? because the entirely_new_db.db have all the datasets of industry_db.db.
I'm not sure if I'm doing something wrong or it's just the weird issue continuing here...
Do I need to delete the prodigy.db file and create a new one?

No, you’re definitely doing this correctly. I think you might have actually discovered a bug that causes the database settings to not overwrite the defaults / pre-defined settings in the prodigy.json. So it’s not respecting your custom database name, and only using the default prodigy.db or whatever else is specified in your prodigy.json.

Can you check your prodigy home directory (by default ~/.prodigy) and check which .db files are in there? And, as a workaround, try setting your SQLite config in the prodigy.json:

{
    "db": "sqlite",
    "db_settings": {
        "name": "industry_db.db"
    }
}

And then just call:

db = connect()

The only db file in the prodigy home directory is prodigy.db.
Unfortunately, when I set my SQLite in the prodigy.json nothing new happens, the problem is the same.

Okay, so this at least confirms my suspicion that the settings aren’t overwritten correctly when using connect(). Will investigate this and fix whatever caused the problem. (Sorry about that!)

But I also just tested it with a new database name in the prodigy.json, and it worked fine for me. Prodigy created a new .db file in the directory, and the db.datasets were empty. What does your prodigy.json look like? Did you use the exact config I posted above?

Edit: Forgot to add – as a workaround you can always just rename prodigy.db to something else, and it will create a new prodigy.db. This isn’t very satisfying though, but if everything else fails, this will at least keep you working.

Just wanted to update that the workaround worked for me and in the new DB the datasets are not overwritten. I still can't create a new .db file using the prodigy.json though.

Thanks a lot for your help!

I’m seeing this behavior too. Unfortunately, with our workflow we were hoping to be able to reuse dataset names, so switching to a new database file isn’t a great option for us. Here’s a simple set of commands that demonstrates the issue.

from prodigy.components.db import connect
from prodigy import set_hashes
examples = [{'text': 'Example1', 'label': 'Nice', 'answer': 'accept'},
... {'text': 'Example2', 'label': 'Nice', 'answer': 'reject'}]
examples = [set_hashes(eg) for eg in examples]
db = connect()
23:32:52 - DB: Initialising database SQLite
23:32:52 - DB: Connecting to database SQLite
assert 'cmgtest' not in db
db.add_dataset('cmgtest')
23:33:03 - DB: Creating dataset 'cmgtest'
<prodigy.components.db.Dataset object at 0x110df2080>
db.add_examples(examples, ['cmgtest'])
23:33:10 - DB: Getting dataset 'cmgtest'
23:33:10 - DB: Added 2 examples to 1 datasets
print(len(db.get_dataset('cmgtest')))
23:33:16 - DB: Loading dataset 'cmgtest' (2 examples)
2
db.drop_dataset('cmgtest')
23:33:30 - DB: Removed dataset 'cmgtest'
True
assert 'cmgtest' not in db
db.add_dataset('cmgtest')
23:33:43 - DB: Creating dataset 'cmgtest'
<prodigy.components.db.Dataset object at 0x110df20b8>
print(len(db.get_dataset('cmgtest')))
23:33:51 - DB: Loading dataset 'cmgtest' (1 examples)
1
print(db.get_dataset('cmgtest'))
23:34:00 - DB: Loading dataset 'cmgtest' (1 examples)
[{'label': 'Nice', '_input_hash': 1582969015, 'answer': 'reject', '_task_hash': 19451014, 'text': 'Example2'}]
db.add_examples(examples, ['cmgtest'])
23:34:18 - DB: Getting dataset 'cmgtest'
23:34:18 - DB: Added 2 examples to 1 datasets
print(len(db.get_dataset('cmgtest')))
23:34:29 - DB: Loading dataset 'cmgtest' (3 examples)
3
print(db.get_dataset('cmgtest'))
23:34:33 - DB: Loading dataset 'cmgtest' (3 examples)
[{'label': 'Nice', '_input_hash': 1582969015, 'answer': 'reject', '_task_hash': 19451014, 'text': 'Example2'}, {'label': 'Nice', '_input_hash': -544789127, 'answer': 'accept', '_task_hash': 1326324553, 'text': 'Example1'}, {'label': 'Nice', '_input_hash': 1582969015, 'answer': 'reject', '_task_hash': 19451014, 'text': 'Example2'}]
db.drop_dataset('cmgtest')
23:34:37 - DB: Removed dataset 'cmgtest'
True
assert 'cmgtest' not in db
db.add_dataset('cmgtest')
23:34:47 - DB: Creating dataset 'cmgtest'
<prodigy.components.db.Dataset object at 0x110df2a20>
print(len(db.get_dataset('cmgtest')))
23:34:51 - DB: Loading dataset 'cmgtest' (3 examples)
3
db.add_examples(examples, ['cmgtest'])
23:34:58 - DB: Getting dataset 'cmgtest'
23:34:58 - DB: Added 2 examples to 1 datasets
print(len(db.get_dataset('cmgtest')))
23:35:03 - DB: Loading dataset 'cmgtest' (5 examples)
5

Is there any way to wipe out the examples too? The examples only belong to that dataset, and when I drop_dataset() I’m happy to remove all the associated examples.

Thanks a lot for the detailed report and example and sorry about the frustration!

I’m pretty sure we fixed a bug for Prodigy v1.4.1 that was related to the hashing and ended up causing of this issue. We’re just testing the new version and will release the update asap!

Excellent. Thanks for the quick response.

1 Like

Just released v1.4.1, which includes a fix to the hashes that ensures that the database is able to always get the correct hashes for a given dataset. This should hopefully also resolve this issue!

I’m still getting this behavior with 1.4.2. If I drop a dataset and then create a new dataset with the same name, all of the examples that were in that dataset reappear.

@andy Thanks for the heads up. Will definitely investigate.

@ines
Hi,
I also faced a similar issue recently. Prodigy creates a new dataset and then loads some examples which should not have been in the dataset. Moreover these examples are not even present in the stream.
Here is a demonstration of what is happening :