Duplicated prodigy output in json

AnnaAnia · November 21, 2019, 10:19pm

Hi,

I seem to be getting some duplicated prodigy outputs from manual ner annotations. Please can you advise?

Performed and saved annotations =3
Exported = 4

The duplicated rows have the same _input_hash and _task_hash and only one span annotated

I have run this 3 times with different examples and different number of annotations, each time clearing the dataset and examples. Out of the 3 times 2 times I have had duplicated output.

I use Postgres to store the data and that is showing the correct number of rows.

Please can you advise why the duplicates?

ines · November 22, 2019, 1:36am

Hi! Are you using the same datasets and deleting them, or are you using new dataset names? And which version of Prodigy are you using?

This makes sense, because Prodigy will only ever store an example once. So even if an example with the same hashes is part of two datasets (or the same dataset), it will only have one row in the examples table. To assign it to the dataset(s), it'll then be linked via the links table.

AnnaAnia · November 23, 2019, 9:17am

Hi Ines, thank you for coming back to me so promptly.
Prodigy version is 1.8.4
Data sets - same name but as mentioned tables delete each time, prodigy stats recipe showing nill sets and sessions and the duplication is not for every row.
Please let me know if you need to see any more code to replicate.
Thank you
Anna

ines · November 23, 2019, 12:53pm

Could you try upgrading to v1.8.5 and see if the problem still occurs? In v1.8.5, we fixed a two specific dataset and session-related problems (one of which was only introduced in v1.8.4) that could be what's affecting you here.

AnnaAnia · December 4, 2019, 9:41am

Hi @ines,

We have installed v1.8.5 however continue to experience the same issue.

These are the steps I took:

deleted the dataset and example tables in my Postgres database
set up a new dataset - new name previously not used
started annotations using the same json as in all previous attempts

Please see the attached set up and output

AnnaAnia · December 4, 2019, 9:43am

And out config file...

It looks like precious activity is being cached somewhere even though dataset and example tables are cleared.
How can this be completely cleared?

As mentioned Postgres has the correct number of rows so I can always read myself directly from there. How do I use your database connection recipe to make my own query to the database?

Thank you

Anna

ines · December 4, 2019, 8:49pm

Thanks for the detailed info! It looks like there might be some problem with how Postgres cleans up the examples and links. Internally, each example is only stored once in the "Example" table. One example can be part of more than one dataset and the "Link" table stores the example → dataset mapping.

How many links do you see in the "Link" table? What might be happening here is that there are only 5 examples, which are linked to the dataset multiple times. And maybe the links somehow got out-of-sync, so they're not cleaned up correctly. It's a bit confusing, because we're not doing anything special in Prodigy – we're mostly calling into peewee (see prodigy/components/db.py for the implementation).

Here's a simple example of how to connect to the DB in Python (you can find more details in your PRODIGY_README.html). I suspect that this will also return the wrong number of examples, since this is what Prodigy calls under the hood.

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("nb_workspace_1")

AnnaAnia · December 5, 2019, 10:32am

Many thanks @ines

How many links do you see in the "Link" table?

There seems to be way to many dataset_id per example in the Links table which I expact have just built up over time.
I order to start from scratch, should I be deleting dataset, example and link tables? Are there any other tables I should be clearing?

Thank you

Anna

ines · December 7, 2019, 10:33am

Okay, so I think this might have been the problem then. Also, if you remove an example from the Example table manually and don't also remove the link from the Link table, it's possible to end up with out-of-sync examples like that.

Yes, the only tables Prodigy uses and creates are Dataset, Example and Link.

AnnaAnia · December 11, 2019, 8:07am

Looks like that's where I was going wrong. Thank you

Topic		Replies	Views
Duplicates in revised annotations usage	2	576	May 29, 2019
Duplicated examples in db-out for ner.train usage , ner , database	6	382	October 11, 2022
Annotated data being saved to wrong dataset (race condition when saving data) database	6	597	May 16, 2019
Example repeated/duplicated within and across sessions usage , textcat , multi-user	5	476	December 20, 2022
Restarting Prodigy with a new session usage , solved	9	2009	October 28, 2022

Duplicated prodigy output in json

Related topics