I seem to be getting some duplicated prodigy outputs from manual ner annotations. Please can you advise?
Performed and saved annotations =3
Exported = 4
The duplicated rows have the same _input_hash and _task_hash and only one span annotated
I have run this 3 times with different examples and different number of annotations, each time clearing the dataset and examples. Out of the 3 times 2 times I have had duplicated output.
I use Postgres to store the data and that is showing the correct number of rows.
Hi! Are you using the same datasets and deleting them, or are you using new dataset names? And which version of Prodigy are you using?
This makes sense, because Prodigy will only ever store an example once. So even if an example with the same hashes is part of two datasets (or the same dataset), it will only have one row in the examples table. To assign it to the dataset(s), it'll then be linked via the links table.
Hi Ines, thank you for coming back to me so promptly.
Prodigy version is 1.8.4
Data sets - same name but as mentioned tables delete each time, prodigy stats recipe showing nill sets and sessions and the duplication is not for every row.
Please let me know if you need to see any more code to replicate.
Thank you
Anna
Could you try upgrading to v1.8.5 and see if the problem still occurs? In v1.8.5, we fixed a two specific dataset and session-related problems (one of which was only introduced in v1.8.4) that could be what's affecting you here.
It looks like precious activity is being cached somewhere even though dataset and example tables are cleared.
How can this be completely cleared?
As mentioned Postgres has the correct number of rows so I can always read myself directly from there. How do I use your database connection recipe to make my own query to the database?
Thanks for the detailed info! It looks like there might be some problem with how Postgres cleans up the examples and links. Internally, each example is only stored once in the "Example" table. One example can be part of more than one dataset and the "Link" table stores the example → dataset mapping.
How many links do you see in the "Link" table? What might be happening here is that there are only 5 examples, which are linked to the dataset multiple times. And maybe the links somehow got out-of-sync, so they're not cleaned up correctly. It's a bit confusing, because we're not doing anything special in Prodigy – we're mostly calling into peewee (see prodigy/components/db.py for the implementation).
Here's a simple example of how to connect to the DB in Python (you can find more details in your PRODIGY_README.html). I suspect that this will also return the wrong number of examples, since this is what Prodigy calls under the hood.
from prodigy.components.db import connect
db = connect()
examples = db.get_dataset("nb_workspace_1")
There seems to be way to many dataset_id per example in the Links table which I expact have just built up over time.
I order to start from scratch, should I be deleting dataset, example and link tables? Are there any other tables I should be clearing?
Okay, so I think this might have been the problem then. Also, if you remove an example from the Example table manually and don't also remove the link from the Link table, it's possible to end up with out-of-sync examples like that.
Yes, the only tables Prodigy uses and creates are Dataset, Example and Link.