Duplicate examples when loaded in separate batches

Remy · August 28, 2020, 5:46pm

Hi.
I'm having trouble avoiding duplicates while annotating data. I tried to reproduce the problem in the code sample below. In this sample, the examples 1, 2 and 15 are duplicates. In the annotation process, the examples 1 and 15 are displayed but not the example number 2. My guess is that examples 1 and 2 are loaded together as part of the same batch, so the deduplication occurs at this level which prevents example 2 to be displayed. However, example 15 is loaded in the next batch, and the fact that a similar example has already been labeled just before is not taken into account.
If I save the annotated examples to the database after annotating the first example, stop the annotation process and restart it, the duplicate examples are not shown anymore. I suppose this is due to the "auto_exclude_current" behaviour which compares hashes with those stored in the database. However, saving the annotated examples does not help if the process is not stopped and restarted before encountering duplicates, which makes me think that the hashes from the database are retrieved on startup but not updated afterwards when new annotated examples are added to it?

import prodigy
from prodigy import set_hashes
@prodigy.recipe(
"test-recipe", dataset=("Dataset to save answers to", "positional", None, str),
)
def test_labeller(dataset: str):
stream = [
{"text": "a"},
{"text": "a"}, # example is duplicate of first one
{"text": "b"},
{"text": "c"},
{"text": "d"},
{"text": "e"},
{"text": "f"},
{"text": "g"},
{"text": "h"},
{"text": "i"},
{"text": "j"},
{"text": "k"},
{"text": "l"},
{"text": "m"},
{"text": "a"}, # example is duplicate of first one
{"text": "n"},
{"text": "o"},
]
def add_options(stream):
# Helper function to add options to every task in a stream
for task in stream:
task["options"] = [
{"id": "1", "text": "option 1"},
{"id": "2", "text": "option_2"},
]
task = set_hashes(task, input_keys=("text"))
yield task
stream = add_options(stream) # add options to each task
return {
"dataset": dataset, # Name of dataset to save annotations
"stream": stream, # Incoming stream of examples
"view_id": "blocks", # Annotation interface to use
"config": {
"blocks": [
{
"view_id": "html",
"html_template": "
Feature 1: {{text}}
",
},
{"view_id": "choice", "text": None},
],
},
}

I have tried moving the duplicate example further down in the list of examples (say 40 examples after its first occurrence) so that the batch that contains the duplicate is loaded only after the first occurrence has already been saved to the database, but it is still not filtered out.
For some reason, setting the "force_stream_order = True" in the config seems to solve this issue, but I would like to run an annotation task with multiple named sessions, and I do not wish to create duplicates between annotators. I thus need to set "feed_overlap = False", and because my annotators may work simultaneously, this post seems to indicate that setting "force_stream_order = True" is not recommended (Option feed_overlap=false doesn't show expected behaviour)
Am I missing something, or is there any way to enforce that no duplicate example is shown to the annotator, even if they are not loaded together as part of the same batch?

ines · August 28, 2020, 6:17pm

Hi! Thanks for the detailed report

Haven't run your code yet but one quick suggestion: Have you tried the filter_duplicates helper? This will keep an internal count of hashes passing through the wrapper and you can configure it to filter by input or task hash.

nix411 · October 29, 2020, 11:22am

Is it possible to apply filter_duplicates to ner.teach e.g.? Or do I need to create my custom recipe (copying the signature from ner.teach and then more or less wrapping it)?

ines · October 29, 2020, 6:22pm

ner.teach should handle this out-of-the-box and not actually present you with any duplicates Setting dedup=True on the get_stream helper will wrap the incoming raw data in filter_duplicates.

nix411 · October 30, 2020, 10:37pm

Wait I might have misread the original post. My issue is not duplicates but rather my stream "restarts" if I restart the server. Instead I'd like to continue from wherever I left of the stream when I stopped the server.

ines · November 2, 2020, 9:23am

Does setting "feed_overlap": false in your config/prodigy.json help?

Topic		Replies	Views
Duplicated annotation when changing version ner , spacy	6	556	November 9, 2022
Examples from stream are shown twice usage , custom , streams	13	651	October 26, 2021
Duplicates in AudioVideo tasks usage , streams	2	387	November 4, 2021
Duplicate examples shown even though my custom recipe generates them once done , streams	5	1041	June 5, 2020
Help! I have duplicates or missing data: Best practices on accounting for annotations best-practices	6	705	September 13, 2023

Duplicate examples when loaded in separate batches

Related topics