Refresh browser fix with force_stream_order

Hi, I've tried the using the "force_stream_order" to fix issue of losing data when users refresh their browser. As expected, the same tasks they left off is resent. Problem is somehow prodigy loops back to resend duplicate tasks as well. So instead of say 60 datasets from your source, it'd allow 90 annotations. This is a multi-user prodigy instance btw.

Hi! How is your recipe set up? Do you still have any custom stream logic in there that decides what to send out dynamically, or keeps looping over your data? And what exactly do you mean by "prodigy loops back to resend duplicate tasks as well"? Is this after the stream is exhausted?

Im not using a custom stream logic when using force_stream_order. Here's my recipe:

@recipe(
"custom-recipe",
dataset=("Dataset to save answers to", "positional", None, str),
view_id=("Annotation interface", "option", "v", str),
spacy_model=("The base model", "positional", None, str),
source=("The source data as a JSON file", "positional", None, str),
label=("One or more comma-separated labels", "option", "l", split_string),
patterns=("Optional match patterns", "option", "p", str),
exclude=("Names of datasets to exclude", "option", "e", split_string),
)
def custom_recipe(
    dataset: str,
    view_id: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]],
    patterns: Optional[str] = None,
    exclude: Optional[List[str]] = None):
LOGGER.info('RECIPE: Starting recipe textcat.custom-recipe')

# Load the stream from a JSONL file and return a generator that yields a
# dictionary for each example in the data.
print(source)
stream = JSONL(source)

#def on_save(answers):
    #
def update(answers):
    # This function is triggered when Prodigy receives annotations
    print(f"\nReceived {len(answers)} annotations!")
    for ans in answers:
        del ans['html'] #remove html coming from view id
        del ans['id']

def on_exit(controller):
    """
    Triggered when server is stopped i.e Ctrl C
    A function that is invoked when you stop the Prodigy server.
    It takes the Controller as an argument, giving you access to the database.
    Split data into train, validation, test
    Write data in flashtext format
    """

return {
    "dataset": dataset,
    "view_id": 'blocks',
    "stream": list(get_data(stream)), #list conversion for progress bar on UI
    "update": update,
    "on_exit": on_exit,
    #"on_save": on_save,
    "config": {
        "blocks": [
            {"view_id": "html"},
            {"view_id": "text_input",
             "field_id": "answer_comment",
             "field_label": "Enter your comments here (Optional)",
             "field_rows": 5,
             "field_placeholder": "e.g Description is too short, and contains undefined abbreviations"
            }
        ]

    },
    'db': db,
}

def get_data(stream):
template = env.get_template('prodigy_render.html')
for task in stream:
    task['meta']['description'] = task['text']
    result = {
        'html': template.render(name=task['meta']['descriptionName'], description=task['text']),
        'meta': task['meta'],
        'id': task['meta']['descriptionName'],
    }
    yield result

Also

I meant that with force_stream_order enabled prodigy resends tasks that have already been annotated. And so the number of tasks saved to the DB is more that the number of datasets from the source.
At one point, I noticed 3 batch_sizes (set to 10) were resent (i.e 30), causing duplicates in the database.
To reproduce this issue, with a sample of 60 datasets, I annotated 10, saved them, refreshed the browser and kept annotating. It played a total of 90 (instead of 60) tasks.

I set the exclude_by config to "task" and that didn't work either.

To make sure it wasn't my custom recipe, I reproduced the same issue with force_stream_order replaying annotated data with a built-in recipe

Hi @snd507,

I'm sorry that you're having trouble with duplicate tasks. I'm trying to reproduce the issue you're having, and I can't seem to get it to fail in the way you describe.

I've tried reproducing your problem with a modified version of your recipe and a small dataset of 60 items:

custom_recipe.py

from typing import List, Optional

from prodigy import recipe
from prodigy.components.db import connect
from prodigy.components.loaders import JSONL
from prodigy.util import LOGGER, split_string


@recipe(
    "custom-recipe",
    dataset=("Dataset to save answers to", "positional", None, str),
    view_id=("Annotation interface", "option", "v", str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSON file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    patterns=("Optional match patterns", "option", "p", str),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
)
def custom_recipe(
    dataset: str,
    view_id: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]],
    patterns: Optional[str] = None,
    exclude: Optional[List[str]] = None,
):
    db = connect()
    LOGGER.info("RECIPE: Starting recipe custom-recipe")

    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    print(source)
    stream = JSONL(source)
    return {
        "dataset": dataset,
        "view_id": "blocks",
        "stream": list(stream),  # list conversion for progress bar on UI
        "config": {"blocks": [{"view_id": "classification"}]},
        "db": db,
    }

data.jsonl

{"text": "0", "label": "LABEL"}
{"text": "1", "label": "LABEL"}
{"text": "2", "label": "LABEL"}
{"text": "3", "label": "LABEL"}
{"text": "4", "label": "LABEL"}
{"text": "5", "label": "LABEL"}
{"text": "6", "label": "LABEL"}
{"text": "7", "label": "LABEL"}
{"text": "8", "label": "LABEL"}
{"text": "9", "label": "LABEL"}
{"text": "10", "label": "LABEL"}
{"text": "11", "label": "LABEL"}
{"text": "12", "label": "LABEL"}
{"text": "13", "label": "LABEL"}
{"text": "14", "label": "LABEL"}
{"text": "15", "label": "LABEL"}
{"text": "16", "label": "LABEL"}
{"text": "17", "label": "LABEL"}
{"text": "18", "label": "LABEL"}
{"text": "19", "label": "LABEL"}
{"text": "20", "label": "LABEL"}
{"text": "21", "label": "LABEL"}
{"text": "22", "label": "LABEL"}
{"text": "23", "label": "LABEL"}
{"text": "24", "label": "LABEL"}
{"text": "25", "label": "LABEL"}
{"text": "26", "label": "LABEL"}
{"text": "27", "label": "LABEL"}
{"text": "28", "label": "LABEL"}
{"text": "29", "label": "LABEL"}
{"text": "30", "label": "LABEL"}
{"text": "31", "label": "LABEL"}
{"text": "32", "label": "LABEL"}
{"text": "33", "label": "LABEL"}
{"text": "34", "label": "LABEL"}
{"text": "35", "label": "LABEL"}
{"text": "36", "label": "LABEL"}
{"text": "37", "label": "LABEL"}
{"text": "38", "label": "LABEL"}
{"text": "39", "label": "LABEL"}
{"text": "40", "label": "LABEL"}
{"text": "41", "label": "LABEL"}
{"text": "42", "label": "LABEL"}
{"text": "43", "label": "LABEL"}
{"text": "44", "label": "LABEL"}
{"text": "45", "label": "LABEL"}
{"text": "46", "label": "LABEL"}
{"text": "47", "label": "LABEL"}
{"text": "48", "label": "LABEL"}
{"text": "49", "label": "LABEL"}
{"text": "50", "label": "LABEL"}
{"text": "51", "label": "LABEL"}
{"text": "52", "label": "LABEL"}
{"text": "53", "label": "LABEL"}
{"text": "54", "label": "LABEL"}
{"text": "55", "label": "LABEL"}
{"text": "56", "label": "LABEL"}
{"text": "57", "label": "LABEL"}
{"text": "58", "label": "LABEL"}
{"text": "59", "label": "LABEL"}

I ran prodigy with:

prodigy custom-recipe test_dataset en_core_web_sm ./data.jsonl -F ./custom_recipe.py

Then I annotated 10 examples and refreshed, repeating that until the stream was empty. No matter what I try it saves 60 examples to the database. Can you help me adjust the example above, so it fails like you are seeing?

Hi everyone,

I've actually had the same experience quite recently. It seems that the problem occurs only when I set multiple user sessions, because it also worked for me when I was the only authorized user. So @justindujardin could you maybe please retry your example with multiple sessions ?

Also I think I've found a solution to this little problem. It comes from (I think) some kind of incompatibility of feed_overlap and force_stream_order. Here's what I tried:

I created a dataset of 50 examples and set up a workflow with 2 authorized sessions, built-in pos.correct recipe, batch size set to 10, instant submit activated.

First I launched the workflow with feed_overlap = false and force_stream_order = true. As I understand it, there would be no duplicates in tasks, and users would be receiving tasks in their original order. Then I opened both sessions at the same times, expecting that session1 would start from example1 and session from example11. Yet actually both sessions started with example1 and just continued until example50, however I refreshed both pages. I ended up with 100 examples in the database.

It kind of felt like feed_overlap had no effect when force_stream_order was activated. Then I desactivated force_stream_order and retried the same workflow, and that's when I got what I expected: session1 beginning with example1 and session2 with example11, no duplicates through out the workflow. It was perfect. Which actually makes sense because technically we're not following the original order of the examples. If that's the case, could you please specify this in the documentation ? :slight_smile: Thank you !

Oh and @snd507 for the missing data when refreshed I have a not-so-elegant solution: you can try set the batch size to 1. Not very pretty but worked for me.

Thanks for the extra info @Kairine!

I've been able to reproduce duplicate items, but inconsistently so far. I'm spending time today debugging it further, and I'll post an update here when I find something conclusive.

Thanks for your patience! :bowing_man:

@Kairine Thanks for chiming and sharing your solution. However, I set the batch_size to 1, "force_stream_order": true, "feed_overlap": false and still had 62 instead of 60 saved annotations. What does your config file look like? Do you have instant_submit set to true? Because I don't, and would like to give users the ability to undo a decision.

Glad to help :smile: I have both parameters set to false, and instant_submit to true. We originally made that choice because we didn't want to lose annotations simply because we forgot to save them (which happened once and it was a sad day). I'm not sure if it played a role in the dedup though.

@snd507 and @Kairine, because of your reports I was able to reproduce the problem and come up with fixes for the duplicate entries you see when using force_stream_order. The fixes will be included in the next release. There were a few problems:

The python app recently transitioned to FastAPI which runs requests in threads, and this exposed a data race when the server was trying to receive answers and return new tasks at the same time. To fix this, I moved the thread lock that we use up from the Feed class to the Controller class, where it can lock both reads and writes to the state.

The frontend had an off-by-one error while using force_stream_order and asking the server for new questions. Specifically, it didn't include the example you were currently answering in the list of examples to exclude from the next batch. This resulted in the server sending back a duplicate of the current example when asked for new questions.

The frontend didn't prevent you from asking the server for more questions while you were in the middle of asking for questions. This meant that if you answered questions very quickly (e.g. holding down a shortcut key) you could cause the client to call get_session_questions multiple times in a row. With force_stream_order this meant you could get duplicates.

To confirm the fixes, I used the dataset of 60 examples above and held down a shortcut key to answer the problems as quickly as possible until there were no more questions. Before the fixes were applied this resulted in a variable total number of examples always greater than 60. After applying the fixes I ran multiple tests with named and unnamed sessions, and they all resulted in a total of 60 annotations.

Thanks so much for your help tracking down these issues, we'll update this thread when the next version is released. :clap: :bowing_man:

@justindujardin Thanks for investigating this bug. Any idea when the next release will be?

@justindujardin thanks for the update. I'm guessing this addresses my issue in the other thread (Duplicate examples shown even though my custom recipe generates them once)

When do you plan to release a version with this bug fix? Or what is the latest previous version that doesn't have this problem? I'm currently blocked by this unfortunately.

The force_stream_order setting is very new and was only introduced recently in v1.9, so this would have always been the case. Once our tests pass, we can push a v1.9.10 patch release that only includes this fix.

2 Likes

Just released v1.9.10, which should fix the underlying problem with force_stream_order (explained in detail by @justindujardin in this post). The only case where a glitch may still be possible with the current implementation is if you hold down a hotkey and rapid fire – but that should also be a pretty unusual scenario.

1 Like

Thanks, I'll go ahead and test it out. Great turnaround btw.

Thanks for the update !

@justindujardin @ines I was able to confirm force_stream_order works fine with single annotator. However, with multi-user sessions, feed_overlap=False doesn't seem to work. Prodigy loads the same 60 examples for each annotator even though feed_overlap is set to False.

@snd507 I confirmed that force_stream_order=True and feed_overlap=False cannot be combined when using named sessions. I've come up with a fix that lets you combine the flags (with a warning), and it should be available in the next release. The reason that combining the flags will produce a warning is that prodigy doesn't know which questions have been asked but not answered, and so it can still show overlapping examples if multiple users are annotating at the same time.

To understand why this happens, consider the following pseudo-configuration:

batch_size = 2
dataset = ["one", "two", "three", "four", "five", "six"]
sessions = ["user1", "user2"]
force_stream_order = True
feed_overlap = False

Because force stream order repeats the same questions until they're answered, when "user1" and "user2" open their browsers to annotate at the same time, they're sent the same initial batch of questions ["one", "two"]. When "user1" then answers the first question, there is no client/server communication to let "user2" know that question has been answered, and that they shouldn't answer it. If "user2" were to refresh though, they would see a new batch ["two", "three"] because "user1" answered the first question.

While the configuration may still produce some overlap, it does minimize it. The overlapping entries can later be resolved to single answers if needed by using the "review" recipe or a custom script to remove duplicates.

1 Like

Environment:
Prodigy 1.9.10 (updated today 2020-06-05)
batch size == 10
I get the same behavior in a Docker Linux image connected to a remote Postgres database with multiple users and on my own Windows 10 machine using sqlite as a local user.
I get the same behavior using textcat.manual and textcat.teach
example cmd: prodigy textcat.manual warranty_manual_01 ./data/sentences.jsonl
I observed this same behavior in Prodigy 1.9.9

Here are two examples of undesirable behavior I am seeing when using "force_stream_order":true in prodigy.json:

Example 1:
Suppose I annotate 15 examples, click save, exit the browser and then open it again. force_stream_order forces me to re-annotate those same 15 examples.
The reason that this is undesirable for my use cases is that if my annotator labels 100 examples, closes her browser, and then comes back an hour later to annotate another 100 examples, she will be forced to annotate the same 100 example sentences before she can annotate the next 100 examples. Also, when I go into the sql database, I can see each sentence annotated twice.

Example 2:
Note: This behavior only occurs if I am annotating fast enough to see the "Loading..." screen:
As soon as I annotate my 23rd example, I get the "Loading..." screen, and then the words "No tasks available" will quickly flash onto the screen for half a second. Then, it forces me to start over from scratch to annotate all the examples from the beginning, again. I know that I haven't run out of tasks because there are still hundreds of additional examples in the jsonl file that haven't been annotated. Also, I know I haven't run out of tasks because if I annotate slow enough that I don't see the "Loading..." screen then Prodigy doesn't restart after 23 examples.

Is there any way for me to avoid this behavior and still use "force_stream_order"? I don't get either of these behaviors when "force_stream_order":false