Issue with multi-user session

Hi there! I have been working on setting up a text classification task with multiple annotators. I want each annotator to label the same set of data so that each example receives 3 scores by the 3 different annotators. All annotators should be able to access and label data simultaneously, through their individual sessions.

So far, I've set Prodigy up so that each user goes to "http://localhost:8080/" and adds their session ID to the end of the URL (e.g. "/?session=rosamond"). I've set "feed_overlap" to "true" in a config file in the project's working directory. However, I'm not getting the behavior I expected from Prodigy. I run my prodigy command ( python -m prodigy textcat_sent_sequence sent_dataset input_paragraphs.jsonl en_core_web_sm --label FORMAL,GRAND -F multi_textcat_sent_sequence.py) and then go to "http://localhost:8080/?session=rosamond" and am able to label all of the data for the session 'rosamond', as expected. However (without doing anything else) if I change the URL to end with session rosamond2 I only see the final example; after labeling that one example I receive the message "No tasks available."

Then, if I open my terminal, I see the message ⚠ Front End Log - 2023-02-01 18:31:37+00:00: Duplicate _task_hash found in Frontend batch.

If I close that Prodigy session, re-run the above Prodigy command, and go to http://localhost:8080/?session=rosamond2, then I'm able to label all of the data for the second user, as expected.

I'm not sure if this is expected behavior, but I'm a bit confused as to what I should do. I want to run a single Prodigy session and have all users able to annotate data in that same session, using their user ids added to the URL. I don't want to have to manually close/reopen Prodigy sessions for each annotator.

Other details if helpful:

  • I am running Prodigy v. 1.11.9
  • I've tried doing this locally and through ngrok, and have experienced the same behavior through both options.

hi @rosamond!

Can you try logging? I'm wondering that your feed_overlap may not be set to true like you think.

Try adding PRODIGY_LOGGING=verbose:

PRODIGY_LOGGING=verbose python -m prodigy textcat_sent_sequence sent_dataset ...

Then look for CONFIG and FEED:

$ PRODIGY_LOGGING=verbose python -m prodigy ner.manual ner_ex1 blank:en nyt_text_dedup.jsonl --label ORG

...

20:54:00: CONFIG: Using config from global prodigy.json
/Users/ryan/.prodigy/prodigy.json

20:54:00: DB: Initializing database SQLite
20:54:00: DB: Connecting to database SQLite
20:54:00: DB: Creating dataset '2023-02-01_20-54-00'
{'created': datetime.datetime(2023, 2, 1, 20, 50, 36)}

20:54:00: FEED: Initializing from controller
{'auto_count_stream': True, 'batch_size': 10, 'dataset': 'ner_ex1', 'db': <prodigy.components.db.Database object at 0x11c3070a0>, 'exclude': ['ner_ex1'], 'exclude_by': 'input', 'max_sessions': 10, 'overlap': False, 'self': <prodigy.components.feeds.Feed object at 0x11c307c40>, 'stream': <generator object at 0x11c1b8540>, 'target_total_annotated': None, 'timeout_seconds': 3600, 'total_annotated': 0, 'total_annotated_by_session': Counter(), 'validator': <prodigy.components.validate.Validator object at 0x11c306a70>, 'view_id': 'ner_manual'}

...

Two things to notice. First, in the CONFIG, you can see that Prodigy is using the global prodigy.json. This is a good check on whether your local project prodigy.json is being read in our if the global one (which is checked first).

Second, look at the FEED and verify what feed_overlap is. By default, it's set to False.

I'm wondering if your global prodigy.json is overwriting your local project's prodigy.json.

One way to check this is using overrides (see this post for example):

PRODIGY_LOGGING=verbose PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' python -m prodigy ner.manual ner_ex1 blank:en nyt_text_dedup.jsonl --label ORG

This would make sense if you forgot in your rosamond to save your last annotation (make sure before closing out of your browser, you hit save to save to DB any remaining in your batch (client) that haven't been saved yet).

By default, Prodigy will dedup by task_hash. If you're only doing one task, this likely means you have some duplicates in your data.

Check those items and let us know if you're still having issues.

Thank you for this response! Unfortunately, I'm still dealing with this issue, even if I add PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' before the command. I've added the verbose marker, and FEED shows 'overlap': True. I've also changed my global config file to have "feed_overlap":true. I've been careful to always save my work.

I stated this above, but if I end the prodigy session in my terminal and then re-run the prodigy command and use a different session ID, the feed overlap works as expected. However, it doesn't work with my preferred use case: I want to start a prodigy annotation task once in my terminal and then let multiple annotators perform annotation (with different session IDs) on the same port, simultaneously.

Do you have any idea what the issue could be? Thank you!

hi @rosamond!

Thanks for the update.

This is good. At least we can confidently confirm that feed_overlap is set to true.

This is where I think we're (perhaps I am) missing something. This is exactly what should occur with "feed_overlap": true and named multi-user sessions.

Can you try to reproduce this example (the "feed_overlap": true) using this sample dataset?

nyt_text_dedup.jsonl (18.5 KB)

PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' python3 -m prodigy ner.manual ner_ex blank:en nyt_text_dedup.jsonl --label ORG

Then open multiple browsers simultaneously with different session ids ("?session=rosamond", "?session=rosamond2", etc.). Each time it should start with back at I:0 as I show below:

The nice thing with testing this dataset is that it keeps the record number in the meta tag, which will appear each record's "card" in Prodigy (look at bottom right). They're numbered from 0 to 175 and deduped (i.e., no input text are the same).

Let me know if this works.

Hi @ryanwesslen thank you for this example! I used this dataset with the example you sent, and the multi-session annotation occurred exactly as I had expected originally. There must be a bug in how I've set up the prodigy recipe for my project. Do you have any ideas about something that could cause this?

In case it's of use, here's my python script. This is lightly edited from your recommendation about annotating sentences in paragraph context:

import prodigy
import spacy
from prodigy.components.loaders import JSONL
from typing import List, Optional
from prodigy.util import split_string


# Helper functions for adding user provided labels to annotation tasks.
def add_label_options_to_stream(stream, labels):
    options = [{"id": label, "text": label} for label in labels]
    for task in stream:
        task["options"] = options
        yield task

def add_labels_to_stream(stream, labels):
    for task in stream:
        task["label"] = labels[0]
        yield task

@prodigy.recipe(
    "textcat_sent_sequence",
    dataset=("Dataset to save answers to", "positional", None, str),
    examples=("Examples to load from disk", "positional", None, str),
    model=("spaCy model to load", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
)

def textcat_topic(dataset, examples, model, label):
    # import spaCy
    nlp = spacy.load(model)

    # set up stream; may want get_stream() instead to hash/avoid dedup
    stream = JSONL(examples)

    #Add labels to each task in stream
    has_options = len(label) > 1
    if has_options:
        stream = add_label_options_to_stream(stream, label)
    else:
        stream = add_labels_to_stream(stream, label)

    # Render highlight of each sentence 
    def add_html(examples):
        for ex in examples:
            doc = nlp(ex["paragraph"])

            for sent in doc.sents:
                summary_highlight = ex["paragraph"]
                summary_highlight = summary_highlight.replace(
                    sent.text, f"<u style='background-color: yellow;'>{sent.text}</u>"
                )
                ex["text"] = sent.text
                ex["html"] = f"{summary_highlight}"
                ex["label"] = "LEGAL INTERPRETATION"
                yield ex

    # delete html key in output data
    def before_db(examples):
        for ex in examples:
            del ex["html"]
        return examples

    return {
        "before_db": before_db,
        "dataset": dataset,
        "stream": add_html(stream),
        "view_id": "choice" if has_options else "classification",  
    }

PRODIGY_LOGGING=verbose PRODIGY_CONFIG_OVERRIDES='{"feed_overlap":true}' python -m prodigy textcat_sent_sequence sent_dataset test_prodigy_clean.json en_core_web_sm --label FORMAL,GRAND,NONE -F prodigy_textcat.py

Thank you!

hi @rosamond!

Yep, there was a problem with the script. Essentially, I needed to run set_hashes to get unique input hashes (see deduplication/hashing for why/how hashes are used).

For example, if you run:

PRODIGY_CONFIG_OVERRIDES='{"feed_overlap":true}' python3 -m prodigy textcat_sent_sequence sq en_core_web_sm paragraphs.jsonl --label LEGAL1 -F textcat_sent_sequence.py

Open up a browser and use session=ryan:

Starts with the first one. I can label all of the examples.

At any time, if you open a new session, which we'll call session=ryan2:

It will start at the first one again.

Here's the updated recipe:

import prodigy
import spacy
from prodigy.components.loaders import JSONL
from typing import List, Optional
from prodigy import set_hashes
from prodigy.util import split_string


# Helper functions for adding user provided labels to annotation tasks.
def add_label_options_to_stream(stream, labels):
    options = [{"id": label, "text": label} for label in labels]
    for task in stream:
        task["options"] = options
        yield task

def add_labels_to_stream(stream, labels):
    for task in stream:
        task["label"] = labels[0]
        yield task

@prodigy.recipe(
    "textcat-sent-sequence",
    dataset=("Dataset to save answers to", "positional", None, str),
    spacy_model=("spaCy model to load", "positional", None, str),
    source=("Examples to load from disk", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    exclusive=("Treat classes as mutually exclusive", "flag", "E", bool),
)

def textcat_sent_sequence(
    dataset: str,
    spacy_model: str, 
    source: str, 
    label: Optional[List[str]] = None, 
    exclusive: bool = False
):

    # Render highlight of each sentence 
    def add_html(examples):
        for ex in examples:
            doc = nlp(ex["paragraph"])

            for sent in doc.sents:
                summary_highlight = ex["paragraph"]
                summary_highlight = summary_highlight.replace(
                    sent.text, f"<u style='background-color: yellow;'>{sent.text}</u>"
                )
                ex["text"] = sent.text
                ex["html"] = f"{summary_highlight}"
                yield ex

    def set_hash(examples):
        stream = (set_hashes(eg, input_keys=("text", "paragraph")) for eg in examples)
        return stream

    # import spaCy
    nlp = spacy.load(spacy_model)

    # set up stream; may want get_stream() instead to hash/avoid dedup
    stream = JSONL(source)
    stream = add_html(stream)

    #Add labels to each task in stream
    has_options = len(label) > 1
    if has_options:
        stream = add_label_options_to_stream(stream, label)
    else:
        stream = add_labels_to_stream(stream, label)

    # delete html key in output data
    def before_db(examples):
        for ex in examples:
            del ex["html"]
        return examples

    return {
        "before_db": before_db,
        "dataset": dataset,
        "stream": set_hash(stream),
        "view_id": "choice" if has_options else "classification",
        "config": {  # Additional config settings, mostly for app UI
            "choice_style": "single" if exclusive else "multiple", # Style of choice interface
            "exclude_by": "input" if has_options else "task", # Hash value used to filter out already seen examples
        },
    }

I went back and cleaned up some other aspects to the recipe:

  • Consistent with Prodigy's default recipes, I changed the name of the input file from examples to source.
  • Consistent with Prodigy's default recipes, I changed the order of the inputs, putting spacy_model (which is renamed from model) to before the source.
  • I noticed you previously hard coded the label -- see ex["label"] = "LEGAL INTERPRETATION". I removed this and kept the generalization of the multilabels.
  • I also added a new input --exclusive. This is a flag that when you are using multiple labels, if you set --exclusive, it'll force users to choose only 1 label (i.e., mutually exclusive categories). If not set (by default), it will allow non-mutually exclusive labels (i.e., you can select multiple labels). This too is consistent with Prodigy's textcat recipes.
  • Also, I changed the recipe function name from textcat_topic totextcat_sent_sequence to be consistent with the recipe name in the decorator textcat-sent-sequence, which typically uses "-".

Let me know if this doesn't solve your problem.

Thanks @ryanwesslen! Everything has been solved. Once again, I really appreciate your help! This is great.