Examples re-served despite identical hashes

If I run a recipe, annotate and save a single example, tear-down the server, and then re-run the same recipe I see the same example I annotated on the first run. If I annotate and save this example and then export the annotations using db-out, I see two entries for the same example. Each entry has an identical task and input hash. Based on prodigy documentation regarding the hashes, it was my understanding that an example whose task hash exists in the database would be excluded by default. Is there more configuration required on my part to get this behavior or is this simply a bug?

That sounds like it might be worth an investigation. Could you share the example so that we may be able to reproduce the behavior locally? If possible, a step-by-step guide to reproduce the issue would be appreciated. Are you running a custom recipe? If so, it'd be grand if you could share that too.

I'm using
python 3.10.4
prodigy 1.11.7

Here is a simple example:

I spin up a task for a new dataset called 'messages_07_24_22'.

We see the first example presented:

I annotate this first example before saving - the second example is displayed:

...and I kill the server. Here is the shot of my console for this first pass:

I run db-out on the dataset and can see my example here:

I spin up the same task on the same dataset. And then see the same example presented. Note the dataset was loaded as indicated by total count in the progress sidebar:

I annotate this example again before saving:

...and I kill the server again. Here is a shot of my console for the second pass:

And I run db-out again. I can see duplicate hashes in the output:

Here is the prodigy call I am making:

prodigy message-intents-entities-custom messages_07_24_22 en_core_web_sm data/inputs/messages_07_24_22.jsonl -F ./recipes/recipes.py -e

Here is the custom recipe used by that call:

23 @prodigy.recipe(
 24     "message-intents-entities-custom",
 25     dataset=("The dataset to save to", "positional", None, str),
 26     spacy_model=("The base model", "positional", None, str),
 27     file_path=("Path to texts", "positional", None, str),
 28     include_entities=("Whether to include entities (optional)", "flag", "e", bool),
 29 )
 30 def message_intents_entities_custom(dataset, spacy_model, file_path, include_entities):
 31     """Custom recipe for classifying and annotating lead messages."""
 32
 33     if include_entities:
 34         blocks = [
 35             NER_MANUAL_BLOCK,
 36             CUSTOM_ENTITIES_BLOCK,
 37             CHOICE_INTENTS_NO_TEXT_BLOCK,
 38             CUSTOM_INTENTS_BLOCK
 39         ]
 40     else:
 41         blocks = [
 42             CHOICE_INTENTS_BLOCK,
 43             CUSTOM_INTENTS_BLOCK
 44         ]
 45
 46     # load spacy model
 47     nlp = spacy.load(spacy_model)
 48
 49     # use JSONL loader
 50     stream = JSONL(file_path)
 51
 52     # add tokens determined by model
 53     stream = add_tokens(nlp, stream)
 54     stream = add_options(stream, ALL_CATEGORIES)
 55
 56     interface_settings = {
 57         "view_id": "blocks",
 58         "dataset": dataset,
 59         "stream": stream,
 60         "config": {
 61             "blocks": blocks,
 62             "choice_style": "multiple",
 63             "validate": False,
 64             "show_stats": True,
 65             "custom_theme": {
 66                 "cardMaxWidth": 1000,
 67             },
 68             "card_css": {
 69                 "fontSize": 15
 70             }
 71         }
 72     }
 73
 74     if include_entities:
 75         # add entities
 76         interface_settings["config"]["labels"] = ENTITIES
 77
 78     return interface_settings

Thanks!

Hello @graham,
sorry for the late reply.
I was able to reproduce this behavior with a slightly different recipe since I am quite unsure what your two methods add_tokens() and add_options() are doing. My solution for solving this was to add the following to the recipe:

stream = (set_hashes(s) for s in stream)
stream = filter_duplicates(stream, by_input=True, by_task=True)

with importing

from prodigy.components.filters import filter_duplicates
from prodigy import set_hashes

I hope this solution works for you too. Let me know if you have any further questions.