Duplicated prodigy output in json

Hi,

I seem to be getting some duplicated prodigy outputs from manual ner annotations. Please can you advise?

Performed and saved annotations =3
Exported = 4

The duplicated rows have the same _input_hash and _task_hash and only one span annotated

I have run this 3 times with different examples and different number of annotations, each time clearing the dataset and examples. Out of the 3 times 2 times I have had duplicated output.

I use Postgres to store the data and that is showing the correct number of rows.

Please can you advise why the duplicates?

Hi! Are you using the same datasets and deleting them, or are you using new dataset names? And which version of Prodigy are you using?

This makes sense, because Prodigy will only ever store an example once. So even if an example with the same hashes is part of two datasets (or the same dataset), it will only have one row in the examples table. To assign it to the dataset(s), it'll then be linked via the links table.

Hi Ines, thank you for coming back to me so promptly.
Prodigy version is 1.8.4
Data sets - same name but as mentioned tables delete each time, prodigy stats recipe showing nill sets and sessions and the duplication is not for every row.
Please let me know if you need to see any more code to replicate.
Thank you
Anna

Could you try upgrading to v1.8.5 and see if the problem still occurs? In v1.8.5, we fixed a two specific dataset and session-related problems (one of which was only introduced in v1.8.4) that could be what's affecting you here.

1 Like

Hi @ines,

We have installed v1.8.5 however continue to experience the same issue.

These are the steps I took:

  • deleted the dataset and example tables in my Postgres database
  • set up a new dataset - new name previously not used
  • started annotations using the same json as in all previous attempts

Please see the attached set up and output

And out config file...

It looks like precious activity is being cached somewhere even though dataset and example tables are cleared.
How can this be completely cleared?

As mentioned Postgres has the correct number of rows so I can always read myself directly from there. How do I use your database connection recipe to make my own query to the database?

Thank you

Anna

Thanks for the detailed info! It looks like there might be some problem with how Postgres cleans up the examples and links. Internally, each example is only stored once in the "Example" table. One example can be part of more than one dataset and the "Link" table stores the example → dataset mapping.

How many links do you see in the "Link" table? What might be happening here is that there are only 5 examples, which are linked to the dataset multiple times. And maybe the links somehow got out-of-sync, so they're not cleaned up correctly. It's a bit confusing, because we're not doing anything special in Prodigy – we're mostly calling into peewee (see prodigy/components/db.py for the implementation).

Here's a simple example of how to connect to the DB in Python (you can find more details in your PRODIGY_README.html). I suspect that this will also return the wrong number of examples, since this is what Prodigy calls under the hood.

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("nb_workspace_1")

Many thanks @ines

How many links do you see in the "Link" table?

There seems to be way to many dataset_id per example in the Links table which I expact have just built up over time.
I order to start from scratch, should I be deleting dataset, example and link tables? Are there any other tables I should be clearing?

Thank you

Anna

Okay, so I think this might have been the problem then. Also, if you remove an example from the Example table manually and don't also remove the link from the Link table, it's possible to end up with out-of-sync examples like that.

Yes, the only tables Prodigy uses and creates are Dataset, Example and Link.

1 Like

Looks like that's where I was going wrong. Thank you

1 Like