Annotated data being saved to wrong dataset (race condition when saving data)

We're encountering a disconcerting bug where the data is saved to the wrong dataset. So far, we've been unable to replicate consistently so it would be great if we can get some pointers.

For context, we deploy Prodigy servers concurrently using a process manager (pm2) so we can label with multiple annotators / tasks on one server. This means that all the servers get started at the same time. As you know, the session ID is generated using datetime, so they clash on occasion. Usually, if session IDs clash, the recipe crashes as well, and pm2 will re-run the same recipe.

Our hypothesis, is that when saving the data, it saves to datasets with the same session ID and somehow the link gets messed up. But really not sure.

    def add_examples(self, examples, datasets=tuple()):
        """
        examples (list): The examples to add.
        datasets (list): The names of the dataset(s) to add the examples to.
        """
        with self.db.atomic():
            ids = []
            for eg in examples:
                content = ujson.dumps(eg, escape_forward_slashes=False)
                eg = Example.create(input_hash=eg[INPUT_HASH_ATTR],
                                    task_hash=eg[TASK_HASH_ATTR],
                                    content=content)
                ids.append(eg.id)
        if type(datasets) is not tuple and type(datasets) is not list:
            raise ValueError('datasets must be a tuple or list type, not: {}'.format(type(datasets)))
        for dataset in datasets:
            self.link(dataset, ids)
        log("DB: Added {} examples to {} datasets"
            .format(len(examples), len(datasets)))

We are using Prodigy 1.5.1 (not using most recent one because it clashes with neuralcoref) and Postgres database backend.

Thanks!

1 Like

Thanks for the report and sorry about that!

What exactly is the clash here? Is there anything we can do on our end to make this work?

The lastest version of Prodigy did fix several issues related to database links, so it’s possible that those were also responsible for the behaviour you’re seeing here. Is there any way you can run 1.6.1 and see if you’re able to reproduce the problem? Even if you only see it once, this would at least tell us that it’s still happening and not yet fixed.

What exactly is the clash here? Is there anything we can do on our end to make this work?

It’s this error: https://github.com/huggingface/neuralcoref/issues/120#issuecomment-448316255

But I guess we can downgrade to spacy==2.0.12 and thinc==6.10.3. We’ll report back if we see any issues with Prodigy 1.6.1.

Ugh, that msgpack update :disappointed: Did downgrading / pinning msgpack not work for you? If I remember correctly, it was really just a minor/patch version of msgpack that introduced this problem and got pulled in. (Hopefully, this kind of stuff should happen less and less in the future. We’re slowly starting to take more control over our dependencies – see srsly for example We’ll also be switchin Prodigy over to that in one of the upcoming versions.).

Hi @ines @plusepsilon
I encountered an exact opposite of this bug. When I start a new session with a new dataset(blank), sometimes it shows some data in the ‘TOTAL’ displayed on the left side, although the database is fresh and does not have any tagged data.
I am using Prodigy 1.7.1

Thanks

@akshitasood63 Do you mean a new database or a new dataset? Internally, all Prodigy does here is call db.get_dataset, get the length of the returned list of examples and then progressively increment it as new examples are received.

If you check the contents of that dataset manually in Python, what do you see?

I mean new dataset. When I checked the contents of the dataset manually, there was some data present in it. So, I cleared all the datasets and then it was working fine.
But now it is happening again although this does not happen frequently.

EDIT: Also, I am using multiple processes within prodigy. Do you think that might be leading to interference between datasets?