Annotated data being saved to wrong dataset (race condition when saving data)

plusepsilon · January 9, 2019, 8:04pm

We're encountering a disconcerting bug where the data is saved to the wrong dataset. So far, we've been unable to replicate consistently so it would be great if we can get some pointers.

For context, we deploy Prodigy servers concurrently using a process manager (pm2) so we can label with multiple annotators / tasks on one server. This means that all the servers get started at the same time. As you know, the session ID is generated using datetime, so they clash on occasion. Usually, if session IDs clash, the recipe crashes as well, and pm2 will re-run the same recipe.

Our hypothesis, is that when saving the data, it saves to datasets with the same session ID and somehow the link gets messed up. But really not sure.

    def add_examples(self, examples, datasets=tuple()):
        """
        examples (list): The examples to add.
        datasets (list): The names of the dataset(s) to add the examples to.
        """
        with self.db.atomic():
            ids = []
            for eg in examples:
                content = ujson.dumps(eg, escape_forward_slashes=False)
                eg = Example.create(input_hash=eg[INPUT_HASH_ATTR],
                                    task_hash=eg[TASK_HASH_ATTR],
                                    content=content)
                ids.append(eg.id)
        if type(datasets) is not tuple and type(datasets) is not list:
            raise ValueError('datasets must be a tuple or list type, not: {}'.format(type(datasets)))
        for dataset in datasets:
            self.link(dataset, ids)
        log("DB: Added {} examples to {} datasets"
            .format(len(examples), len(datasets)))

We are using Prodigy 1.5.1 (not using most recent one because it clashes with neuralcoref) and Postgres database backend.

Thanks!

ines · January 9, 2019, 8:41pm

Thanks for the report and sorry about that!

What exactly is the clash here? Is there anything we can do on our end to make this work?

The lastest version of Prodigy did fix several issues related to database links, so it's possible that those were also responsible for the behaviour you're seeing here. Is there any way you can run 1.6.1 and see if you're able to reproduce the problem? Even if you only see it once, this would at least tell us that it's still happening and not yet fixed.

plusepsilon · January 9, 2019, 8:56pm

What exactly is the clash here? Is there anything we can do on our end to make this work?

It's this error: Load error: ValueError: 684830 exceeds max_map_len(32768) · Issue #120 · huggingface/neuralcoref · GitHub

But I guess we can downgrade to spacy==2.0.12 and thinc==6.10.3. We'll report back if we see any issues with Prodigy 1.6.1.

ines · January 9, 2019, 9:00pm

Ugh, that msgpack update Did downgrading / pinning msgpack not work for you? If I remember correctly, it was really just a minor/patch version of msgpack that introduced this problem and got pulled in. (Hopefully, this kind of stuff should happen less and less in the future. We’re slowly starting to take more control over our dependencies – see srsly for example We’ll also be switchin Prodigy over to that in one of the upcoming versions.).

akshitasood63 · May 16, 2019, 1:10pm

Hi @ines @plusepsilon
I encountered an exact opposite of this bug. When I start a new session with a new dataset(blank), sometimes it shows some data in the ‘TOTAL’ displayed on the left side, although the database is fresh and does not have any tagged data.
I am using Prodigy 1.7.1

Thanks

ines · May 16, 2019, 1:14pm

@akshitasood63 Do you mean a new database or a new dataset? Internally, all Prodigy does here is call db.get_dataset, get the length of the returned list of examples and then progressively increment it as new examples are received.

If you check the contents of that dataset manually in Python, what do you see?

akshitasood63 · May 16, 2019, 1:19pm

I mean new dataset. When I checked the contents of the dataset manually, there was some data present in it. So, I cleared all the datasets and then it was working fine.
But now it is happening again although this does not happen frequently.

EDIT: Also, I am using multiple processes within prodigy. Do you think that might be leading to interference between datasets?

Topic		Replies	Views
Duplicated prodigy output in json database , solved	9	673	December 11, 2019
Old examples are automatically added to new dataset done , database	15	2042	March 25, 2019
Example repeated/duplicated within and across sessions usage , textcat , multi-user	5	475	December 20, 2022
Old dataset ame with new input file usage , database , solved	1	362	August 21, 2020
UI annotation progress does not match number of examples in database usage , database , server	8	563	December 13, 2021

Annotated data being saved to wrong dataset (race condition when saving data)

Related topics