Hi.
I'm having trouble avoiding duplicates while annotating data. I tried to reproduce the problem in the code sample below. In this sample, the examples 1, 2 and 15 are duplicates. In the annotation process, the examples 1 and 15 are displayed but not the example number 2. My guess is that examples 1 and 2 are loaded together as part of the same batch, so the deduplication occurs at this level which prevents example 2 to be displayed. However, example 15 is loaded in the next batch, and the fact that a similar example has already been labeled just before is not taken into account.
If I save the annotated examples to the database after annotating the first example, stop the annotation process and restart it, the duplicate examples are not shown anymore. I suppose this is due to the "auto_exclude_current" behaviour which compares hashes with those stored in the database. However, saving the annotated examples does not help if the process is not stopped and restarted before encountering duplicates, which makes me think that the hashes from the database are retrieved on startup but not updated afterwards when new annotated examples are added to it?
import prodigy
from prodigy import set_hashes
@prodigy.recipe(
"test-recipe", dataset=("Dataset to save answers to", "positional", None, str),
)
def test_labeller(dataset: str):
stream = [
{"text": "a"},
{"text": "a"}, # example is duplicate of first one
{"text": "b"},
{"text": "c"},
{"text": "d"},
{"text": "e"},
{"text": "f"},
{"text": "g"},
{"text": "h"},
{"text": "i"},
{"text": "j"},
{"text": "k"},
{"text": "l"},
{"text": "m"},
{"text": "a"}, # example is duplicate of first one
{"text": "n"},
{"text": "o"},
]
def add_options(stream):
# Helper function to add options to every task in a stream
for task in stream:
task["options"] = [
{"id": "1", "text": "option 1"},
{"id": "2", "text": "option_2"},
]
task = set_hashes(task, input_keys=("text"))
yield task
stream = add_options(stream) # add options to each task
return {
"dataset": dataset, # Name of dataset to save annotations
"stream": stream, # Incoming stream of examples
"view_id": "blocks", # Annotation interface to use
"config": {
"blocks": [
{
"view_id": "html",
"html_template": "Feature 1: {{text}}",
},
{"view_id": "choice", "text": None},
],
},
}
I have tried moving the duplicate example further down in the list of examples (say 40 examples after its first occurrence) so that the batch that contains the duplicate is loaded only after the first occurrence has already been saved to the database, but it is still not filtered out.
For some reason, setting the "force_stream_order = True" in the config seems to solve this issue, but I would like to run an annotation task with multiple named sessions, and I do not wish to create duplicates between annotators. I thus need to set "feed_overlap = False", and because my annotators may work simultaneously, this post seems to indicate that setting "force_stream_order = True" is not recommended (Option feed_overlap=false doesn't show expected behaviour)
Am I missing something, or is there any way to enforce that no duplicate example is shown to the annotator, even if they are not loaded together as part of the same batch?