hi @JulieSarah!
Thanks for your posts.
If it were me, I would start with trying to change the prodigy.json
's exclude_by
to input
(its default is task
). This will exclude any annotations that have the same input
. Since you want only one annotation per text (input
), this may do the trick.
The first one makes sense but you may find its still deduping simply by task
. I haven't used the rehash
much but I think that may not help as it would treat every new example like a new example (i.e., rehash it).
The exclude
works great if you have know examples you want to exclude. The problem is if you're having simultaneously new examples annotated, your exclude
list may not be up-to-date in realtime.
Related, how often are annotators answering simultaneously? But if you're having annotators who answer very quickly/simultaneously, I don't think this may be the sole reason you're having issues.
See this post:
We've been working this year on an experimental version that overhauls the database. The problem is there are many possible issues for duplications. It's a much harder problem than many realize until they get into it. One issue was corrected in 1.11.7 but several others still existed. That's interesting that you report the issue with the upgraded version. I'll make a note.
We're preparing soon (in a few weeks) to release v1.12 which will incorporate this new database. This is in addition to an early next year release of v2. We decided to push out v1.12 so teams like yours can test this new database before we provide the large overhaul of v2. One part of this change will move the ORM from peewee
to SQLalchemy
, any existing annotations will need to be migrated which is why we have a migration script.
For now, I would suggest trying the changes you mentioned above like exclude_by
: input
but knowing that you may still get duplicates. If you still do, then either try out the experimental version or wait a few weeks for the new v1.12 to come out, and test that.
Sorry there isn't a simple solution but hopefully you can see that there are many upcoming changes on the way to address these issues for the long-term