Few records in in the db for the same example

Hi all,
I noticed that if I have already some sessions saved in the db and the server is restarted, the progress bar starts again from 0%. I think it should count the nr of examples saved in the db for given dataset. However, the total of annotation shown in stats is correct. Also, it starts to ask to annotate already annotated examples.

I wonder what should we do in case we have a large dataset to annotate, let's say 1000 examples, and the server restarts, to avoid that the annotators to repeat all the work.

Regards

The progress bar can indeed be unintuitive, even somewhat buggy in v1.11. What you're likely experiencing though is probably an issue with the progress bar itself, not an issue with annotated examples. We're working on fixing that in v1.12, but in the meantime it might help to explain "why" the progress bar is tricky to interpret.

Why

When you pass a file to Prodigy, it isn't loaded into memory all at once. Prodigy might be working with very large files, so in an attempt to conserve machine resources the texts is streamed in line-by-line. In case it's of interest: under the hood it's using srsly, which works by opening the file and then passing a generator. While a Python generator is nice and lightweight, it does come with a downside: you have no way of knowing how long it is. And this requires Prodigy to make all sorts of assumptions in order to construct a progress bar.

In v1.12 this will be remedied by introducing a new object internally called a Source. This object will still pass a generator downstream, but while it's streaming over the input file it will keep track of the "position" of the file that's been read sofar. This should allow us to give a much more meaningful progress bar.

Just a check

Just to check though, are you seeing duplicates in your stream? If so, that's likely due to another issue that's certainly worth diving into. Are you using a custom recipe?

Hi @koaning ,
The progress bar it's clear, thank you for the explanation. However, the way I implemented the progress bar is the following:

def progress(ctrl, update_return_value):
    return ctrl.session_annotated / 500

So prodigy already knows how many examples there are, and if the server is restarted it should check in the db how many examples are already annotated.

For the problem of duplicated examples, yes, I'm using custom recipe but based on standard classification reciepe.

from typing import List, Optional
import prodigy
from prodigy.components.loaders import get_stream
from prodigy.util import split_string


# Helper functions for adding user provided labels to annotation tasks.
def add_label_options_to_stream(stream, labels):
    options = [{"id": label, "text": label} for label in labels]
    for task in stream:
        task["options"] = options
        yield task

def add_options(stream):
    """Helper function to add options to every task in a stream."""
    options = [
        {"id": "a", "text": "a"},
        {"id": "b", "text": "b"},
        {"id": "c", "text": "c"},
    ]
    for task in stream:
        task["options"] = options
        yield task

def progress(ctrl, update_return_value):
    return ctrl.session_annotated / 500

# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "memt.manual",
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    exclusive=("Treat classes as mutually exclusive", "flag", "E", bool),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
)
def memt_manual(
    dataset: str,
    source: str,
    label: Optional[List[str]] = None,
    exclusive: bool = False,
    exclude: Optional[List[str]] = None,
):
    """
    Manually annotate categories that apply to a text. If more than one label
    is specified, categories are added as multiple choice options. If the
    --exclusive flag is set, categories become mutually exclusive, meaning that
    only one can be selected during annotation.
    """

    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    stream = get_stream(source, rehash=True, dedup=False, input_key="text" )

    #Add labels to each task in stream
    stream = add_options(stream)
   
    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "progress": progress,
        "config": {  # Additional config settings, mostly for app UI
            "batch_size": 10,
            "blocks": [
                {"view_id": "html",   
                    "html_template": "SOURCE:<h5>{{ text }}</h5>PROPOSED TRANSLATION:<h5>{{ translation }}</h5></strong><p style='font-size: 15px'>Client: {{ client }}</p>",
                },
                {"view_id":"choice", "text":None}
            ],
        },
    }

Additionally, as prodigy is asking about annotate already annotated examples (ONLY in case of restarted server) at the end, the total annotated tasks shown in stats is bigger than the real number of existing examples to annotate.

The session_annotated property, per docs, indicates "Number of tasks annotated in the current session (includes all named users in the instance)." So when the server restarts, and a new session is made, it would start again at 0%. That is unless you're passing a session in the URL via http://hostname/?session=.

When I look at your code it seems that add_label_options_to_stream is not used anywhere in your recipe, which also means that the --labels param is currently ignored. Is that intentional?

Just to check, do you have a prodigy.json file around with an exclude_by setting around? When you're seeing duplicates that's usually an indication that there's something going wrong with hashing.

Also; if you have a small dataset together with your custom recipe that is causing an unexpected error ... you can share that with me so that I may reproduce and have a look locally.

Honestly, I don't have experience with prodigy. It's not very clear for me the --labels param.
In my json I have "exclude_by": "task".
The session in url acctually solves the problem!
Thanks!

1 Like

I just performed a real scenario exercise using http://hostname/?session=
and "exclude_by": "task". From 50 examples, 4 are repeated.
In the second exercise, the first example is not saved in the DB.

Is it possible for you to share some examples together with the most up to date version of your recipe? I can try to reproduce it locally.

Hi,

from typing import List, Optional
import prodigy
from prodigy.components.loaders import get_stream
from prodigy.util import split_string

def add_options(stream):
    """Helper function to add options to every task in a stream."""
    options = [
        {"id": "a", "text": "a"},
        {"id": "b", "text": "b"},
        {"id": "c", "text": "c"},
    ]
    for task in stream:
        task["options"] = options
        yield task

def progress(ctrl, update_return_value):
    return ctrl.session_annotated / 50

# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "memt.manual",
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    exclusive=("Treat classes as mutually exclusive", "flag", "E", bool),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
)
def memt_manual(
    dataset: str,
    source: str,
    label: Optional[List[str]] = None,
    exclusive: bool = False,
    exclude: Optional[List[str]] = None,
):
    """
    Manually annotate categories that apply to a text. If more than one label
    is specified, categories are added as multiple choice options. If the
    --exclusive flag is set, categories become mutually exclusive, meaning that
    only one can be selected during annotation.
    """

    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    stream = get_stream(source, rehash=True, dedup=False, input_key="text" )

    #Add labels to each task in stream
    stream = add_options(stream)
   
    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "progress": progress,
        "config": {  # Additional config settings, mostly for app UI
            "batch_size": 10,
            "blocks": [
                {"view_id": "html",   
                    "html_template": "{{id}}. SOURCE:<h5>{{ text }}</h5>PROPOSED TRANSLATION:<h5>{{ translation }}</h5></strong><p style='font-size: 15px'>Client: {{ client }}</p>",
                },
                {"view_id":"choice", "text":None}
            ],
        },
    }

prodigy.json:

{
  "theme": "basic",
  "custom_theme": {},
  "buttons": ["undo"],
  "history_size": 30,
  "port": 8880,
  "host": "xxx",
  "cors": true,
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "xxx",
      "user": "xxx",
      "password": "xxx"
    }
  },
  "validate": true,
  "auto_exclude_current": false,
  "choice_auto_accept": true,
  "feed_overlap": false,
  "auto_exclude_current": false,
  "force_stream_order": true,
  "instant_submit": false,
  "feed_overlap": false,
  "auto_count_stream": true,
  "total_examples_target": 0,
  "instructions": false,
  "ui_lang": "en",
  "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
  "show_stats": true,
  "hide_meta": true,
  "show_flag": false,
  "javascript": null,
  "swipe": false,
  "swipe_gestures": { "left": "accept", "right": "reject" },
  "split_sents_threshold": false,
  "writing_dir": "ltr",
  "show_whitespace": true,
  "exclude_by": "task",
  "global_css": ".prodigy-content { text-align: left} .prodigy-content p{ text-align: right; } .prodigy-container {width: 500px} .prodigy-content, .c01185 {width: 100%} .c01133 { max-width: 2000px; width: 1300px}"

Data example:

{"id":0,"client":"xxx","text":aaa.","translation":"bbb.","engine":"ccc"}

output:
image

As you can see on the image, the records are repeated.

To perform the annotation excersice, I used the url + session:
http://host:XXXX/?session=user1

But honestly, I think it will it be difficult to reproduce the errors. I repeated the same experiment another time and everything went ok.

In addition, I see that there are records with no answer 'accept': []

Regards

Hi @koaning,
If you have a minute to look at this, we will really appreciate. It starts to be urgent as we are not able to run the annotation exercise until solve those issues.

Best regards

hi @zparcheta!

Trying to step in because Vincent is juggling a lot.

I'm trying to catch up but is the root of your problem that you saw some duplicate records when annotating?

I'm not sure I understand what you mean by the "first example" not saved in the DB and if this is a critical problem you're trying to solve.

And as you mention here, you noticed it once, but not again? Any chance that these duplicates tend to be near the end of your stream?

Also I noticed that your prodigy.json keeps the default of feed_overlap: false.

This sounds like it could be work stealing. I just wrote up a detailed response accumulating a lot of details on this and why it's actually a preventive measure to avoid an alternative problem: examples getting dropped. We have lots of plans in the work to provide alternative options (e.g., task routing and in v2, a complete overhaul of our stream generator that would eliminate the need for work stealing).

It should be noted though, that a small number of duplicates is still expected in multi-user workflows with feed overlap set to false. This is perfectly normal behavior and should only occur towards the end of the example stream. These "end-of-queue" duplicates come from the work-stealing mechanism in the internal Prodigy feed. "Work-stealing" is a preventive mechanism to avoid records in the stream from being lost when an annotator requests a batch of examples to annotate, effectively locking those examples, and then never annotates them. This mechanism allows annotators that reach the end of a shared stream to annotate these otherwise locked examples that other annotators are holding on to. Essentially we have prioritized annotating all examples in your data stream at least once vs at most once while potentially losing a few.

If it is work stealing, probably your best tactic on this is to remind your annotators to save their annotations when they're done and don't keep a browser open indefinitely. Another option you can do that will reduce the chance of duplicates is reducing your batch_size to 1. However, this has the trade-off that users can't go back and modify their last example as accepted records will be immediately saved to the database.

Does this make sense?

Hi @ryanwesslen,
Thank you for your extensive answer.

For now, the main problem is the repeated examples in DB. How can I avoid it?
Should I use option feed_overlap: true ?

Second thing is that some of the records in DB have no answer. That doesn't make any sense because the task is automatically accepted when the annotator chooses some option.

Those are two blocking us issues.

Again, this is a little tough. If it is work stealing (the only thing I can think of), my first recommendation is education for annotators. Tell them to make sure to save their annotations when they're done. Also, if you stagger annotators at different times (not run simultaneously) it will reduce the odds. Again, you may not see any issue even if they are running simultaneously, but this would reduce the odds.

Yes, feed_overlap: true may reduce work stealing; but remember, this is a different allocation of annotations where each annotator receives all of the same records. feed_overlap: false is when records are sent out on a "first-come, first-serve" basis to whichever annotator is available.

Can you provide some background on why this is blocking? I know a handful of extra annotations isn't ideal -- but is there any reason why you can't do annotations on 1,000 examples, then simply drop the handful of duplicates after the fact? I may be missing something but in general, work-stealing should only be a small number of duplicates.

Just tried your recipe and realized it's may simply be that your annotator accepted it without making a choice. I was able to do that on the first record. You can add a validation check that prevents this by adding a validate_answer callback:

def validate_answer(eg):
    selected = eg.get("accept", [])
    assert len(selected) > 0, "Select at least 1 category"

Then add that function to your return dict:

return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "progress": progress,
        "validate_answer": validate_answer, # ensure at least 1 category is selected
        "config": {  # Additional config settings, mostly for app UI
            "batch_size": 10,
            "blocks": [
                {"view_id": "html",   
                    "html_template": "{{id}}. SOURCE:<h5>{{ text }}</h5>PROPOSED TRANSLATION:<h5>{{ translation }}</h5></strong><p style='font-size: 15px'>Client: {{ client }}</p>",
                },
                {"view_id":"choice", "text":None}
            ],
        },
}

Hi again @ryanwesslen,
Thank you again for your help.

We have only one annotator per exercise, so work stealing is not possible.

Just tried your recipe and realized it's may simply be that your annotator accepted it without making a choice.

That is not possible because following the prodigy.json we don't have a button "accept" and the task is automatically accepted when the choice is done, so it definitely is a bug.

Interesting. Did the annotator annotate some in 1 tab, leaving some records in the browser w/o saving to DB, opening a new tab and annotating, and then at some point return back to the 1st tab and saved?

I just tried this and it was possible to create duplicates due to this behavior even with 1 annotator.

Hm... I tried your exact same recipe and it had the "accept" button. Let alone, the first record I accepted without selecting a choice and I was able to replicate the problem. Are we talking about the same recipe?

Probably you don't have "buttons": ["undo"], in your prodigy.json. I have defined it globally.

Ah yes. I see.

How many times did this happen? Under what circumstances? Can you replicate this?

I'm scratching my head but this is incredibly hard without a fully reproducible example.

I did annotation of 50 examples 3 times. It happened in each of them at least once. Maybe the annotators clicked too fast, and it was not recorded in DB... I'm just guessing.

Thanks for that context -- Yes, that's possible.

I took this dataset:

nyt_text_dedup.jsonl (18.5 KB)

Ran on your recipe and with your prodigy.json. One difference is I ran it with a local default database (SQLite).

And I ran it by pressing A as fast as possible. Interestingly, I found 1 out of 176 did not have "accept": ["a"].

I noticed you used a PostgreSQL database. Was that locally or cloud? I'd be curious if you could try the exact same experiment.

Hi @ryanwesslen
PostgreSQL database is local.

I did you experiment very quickly, and the db registered only 176 out of 200 example with 6 empty answers. How many records has been registered in your db?

I did the second experiment with "validate_answer" and "choice_auto_accept": true and from time to time the alert is shown which actually fix the issue of empty answers :+1: