Annotated data being saved to wrong dataset (race condition when saving data)

bug
database

(Motoki Wu) #1

We’re encountering a disconcerting bug where the data is saved to the wrong dataset. So far, we’ve been unable to replicate consistently so it would be great if we can get some pointers.

For context, we deploy Prodigy servers concurrently using a process manager (pm2) so we can label with multiple annotators / tasks on one server. This means that all the servers get started at the same time. As you know, the session ID is generated using datetime, so they clash on occasion. Usually, if session IDs clash, the recipe crashes as well, and pm2 will re-run the same recipe.

Our hypothesis, is that when saving the data, it saves to datasets with the same session ID and somehow the link gets messed up. But really not sure.

    def add_examples(self, examples, datasets=tuple()):
        """
        examples (list): The examples to add.
        datasets (list): The names of the dataset(s) to add the examples to.
        """
        with self.db.atomic():
            ids = []
            for eg in examples:
                content = ujson.dumps(eg, escape_forward_slashes=False)
                eg = Example.create(input_hash=eg[INPUT_HASH_ATTR],
                                    task_hash=eg[TASK_HASH_ATTR],
                                    content=content)
                ids.append(eg.id)
        if type(datasets) is not tuple and type(datasets) is not list:
            raise ValueError('datasets must be a tuple or list type, not: {}'.format(type(datasets)))
        for dataset in datasets:
            self.link(dataset, ids)
        log("DB: Added {} examples to {} datasets"
            .format(len(examples), len(datasets)))

We are using Prodigy 1.5.1 (not using most recent one because it clashes with neuralcoref) and Postgres database backend.

Thanks!


(Ines Montani) #2

Thanks for the report and sorry about that!

What exactly is the clash here? Is there anything we can do on our end to make this work?

The lastest version of Prodigy did fix several issues related to database links, so it’s possible that those were also responsible for the behaviour you’re seeing here. Is there any way you can run 1.6.1 and see if you’re able to reproduce the problem? Even if you only see it once, this would at least tell us that it’s still happening and not yet fixed.


(Motoki Wu) #3

What exactly is the clash here? Is there anything we can do on our end to make this work?

It’s this error: https://github.com/huggingface/neuralcoref/issues/120#issuecomment-448316255

But I guess we can downgrade to spacy==2.0.12 and thinc==6.10.3. We’ll report back if we see any issues with Prodigy 1.6.1.


(Ines Montani) #4

Ugh, that msgpack update :disappointed: Did downgrading / pinning msgpack not work for you? If I remember correctly, it was really just a minor/patch version of msgpack that introduced this problem and got pulled in. (Hopefully, this kind of stuff should happen less and less in the future. We’re slowly starting to take more control over our dependencies – see srsly for example We’ll also be switchin Prodigy over to that in one of the upcoming versions.).