Passing the same sample more than once (with different meta-data) to the annotation server

Hi,

I am running a span annotation task using the built-in spans.manual recipe. The samples in my task consist of two parts - the text and a meta-data tag (one of three possible tags). The annotators should consider both text and meta-data when performing the annotation. Some samples may have identical text but different meta-data tags. The input jsonl may look like this:

{"text": "This is a sentence", "meta": {"tag": "TAG1"}}
{"text": "This is a sentence", "meta": {"tag": "TAG2"}}
{"text": "This is another sentence", "meta": {"tag": "TAG2"}}
{"text": "This is yet another sentence", "meta": {"tag": "TAG1"}}
{"text": "This is yet another sentence", "meta": {"tag": "TAG3"}}

In cases where there are samples with identical text (and different meta-data tags), the prodigy server sends only the first sample for annotation and ignores the others. For example, in the case of a jsonl file containing the above 5 rows, the prodigy server sends only the first, third and forth rows for annotation.

To my understanding, this is a result of the default input hashing performed by prodigy. In my case, only the text is used to hash the samples, resulting in identical hash codes for samples with identical text.

I noticed the "prodigy.set_hashs" function in the documentation, and I'm guessing the solution to my problem lies there. However, I am not sure how I am supposed to use it (in the context of using a built-in recipe such as spans.manual).

If someone could explain it or perhaps point me to an example, it would be very much appreciated.

Hi @shuki ,

Apologies for a slightly delayed response!
Your planned approach is correct. In the case of your input data, you want you use a custom hashing function to make sure both text and meta keys are taken into account while computing the input_hash.
You can achieve it either by adding your custom hashing function to the built-in spans.manual recipe (you can access the source code of the recipe in your Prodigy installation path - run prodigy stats to recall where that was). Or, you can add custom hashes to your input jsonl file via an external Python script and just set the rehash parameter of theget_stream function of the built-in recipe to False.
Such external script could look like this:

from typing import Tuple

import srsly
from prodigy.components.stream import get_stream
from prodigy.types import StreamType
from prodigy.util import set_hashes


def custom_input_hashes(stream: StreamType, keys: Tuple[str]) -> StreamType:
    for eg in stream:
        eg = set_hashes(eg, input_keys=keys, overwrite=True)
        yield eg


stream = get_stream("input_tags.jsonl")
stream.apply(custom_input_hashes, stream=stream, keys=("text", "meta"))
srsly.write_jsonl("input_tags_rehashed.jsonl", stream)

As you can see there, the custom hashing function calls Prodigy set_hashes to assign the new input_hash taking into account the two keys.
Then, it writes the modified input file do disk.
You can use it now as input to the built-in spans.manual recipe. However, since the built-in recipes overwrite the hash values by default, you would need to change the line 78 of the recipe so that the rehash parameter is set to False:

 stream = get_stream(
        source, loader=loader, rehash=False, dedup=True, input_key="text"
    )

Alternatively, as I mentioned above, you could add this custom_input_hashes function directly to recipe and apply it to the stream right before the return statement:

stream.apply(custom_input_hashes, stream=stream, keys=("text", "meta"))
return ...

Hi,

Thank you very much for the detailed response!

I took your advice, and modified the recipe to include the custom_input_hashes function and apply it to the stream right before the return statement. As I didn't want to change the built-in recipe, I created a new python file my_recipe.py and copied the spans.manual recipe to it. Made the suggested change, but I am still getting the same behavior, i.e., the server ignores the meta-data and sends every "unique" text only once. More specifically, I am using as input the 5-lines jsonl file from my first post. The server only sends lines 1, 3, 4 to the annotator.

Below is my_recipe.py:

from typing import Callable, List, Optional, Tuple

from spacy.language import Language
from spacy.tokens import Doc
from spacy.util import registry as spacy_registry

from prodigy.components.preprocess import add_tokens
from prodigy.components.stream import get_stream
from prodigy.core import Arg, recipe
from prodigy.models.matcher import PatternMatcher
from prodigy.protocols import ControllerComponentsDict
from prodigy.types import ExistingFilePath, LabelsType, SourceType, StreamType, TaskType
from prodigy.util import (
    get_pipe_labels,
    log,
    msg,
    set_hashes,
)


@recipe(
    "my_recipe",
    # fmt: off
    dataset=Arg(help="Dataset to save annotations to"),
    nlp=Arg(help="Loadable spaCy pipeline for tokenization or blank:lang (e.g. blank:en)"),
    source=Arg(help="Data to annotate (file path or '-' to read from standard input)"),
    loader=Arg("--loader", "-lo", help="Loader (guessed from file extension if not set)"),
    label=Arg("--label", "-l", help="Comma-separated label(s) to annotate or text file with one label per line"),
    patterns=Arg("--patterns", "-pt", help="Path to match patterns file"),
    exclude=Arg("--exclude", "-e", help="Comma-separated list of dataset IDs whose annotations to exclude"),
    highlight_chars=Arg("--highlight-chars", "-C", help="Allow highlighting individual characters instead of tokens"),
    suggester=Arg("--suggesters", "-sg", help="Name of suggester function registered in spaCy's 'misc' registry. Will be used to validate annotations as they're submitted. Use the -F option to provide a custom Python file"),
    use_annotations=Arg("--use-annotations", "-A", help="Use annotations from the specified spaCy model."),
    # fmt: on
)
def my_recipe(
    dataset: str,
    nlp: Language,
    source: SourceType,
    loader: Optional[str] = None,
    label: Optional[LabelsType] = None,
    patterns: Optional[ExistingFilePath] = None,
    exclude: List[str] = [],
    highlight_chars: bool = False,
    suggester: Optional[str] = None,
    use_annotations: bool = False,
) -> ControllerComponentsDict:
    """
    Annotate potentially overlapping and nested spans in the data. If
    patterns are provided, their matches are highlighted in the example, if
    available. The tokenizer is used to tokenize the incoming texts so the
    selection can snap to token boundaries. You can also set --highlight-chars
    for character-based highlighting.
    """
    log("RECIPE: Starting recipe my_recipe", locals())
    labels = get_pipe_labels(label, nlp.pipe_labels.get("spancat", []))
    log(f"RECIPE: Annotating with {len(labels)} labels", labels)

    stream = get_stream(
        source, loader=loader, rehash=True, dedup=True, input_key="text"
    )
    if patterns is not None:
        pattern_matcher = PatternMatcher(
            nlp, combine_matches=True, all_examples=True, allow_overlap=True
        )
        pattern_matcher = pattern_matcher.from_disk(patterns)
        stream.apply(lambda d: (eg for _, eg in pattern_matcher(d)))
    # Add "tokens" key to the tasks, either with words or characters
    stream.apply(add_tokens, nlp=nlp, stream=stream)
    validate_func = (
        validate_with_suggester(nlp, suggester, use_annotations=use_annotations)
        if suggester
        else None
    )

    stream.apply(custom_input_hashes, stream=stream, keys=("text", "meta"))

    return {
        "view_id": "spans_manual",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "validate_answer": validate_func,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "exclude_by": "input",
            "ner_manual_highlight_chars": highlight_chars,
            "auto_count_stream": True,
        },
    }


def validate_with_suggester(
    nlp: Language,
    suggester_name: str,
    *,
    use_annotations: bool,
) -> Callable[[TaskType], None]:
    msg.info(f"Validating annotations against suggester function '{suggester_name}'")
    suggester = spacy_registry.get("misc", suggester_name)()

    def validate_answer(answer: TaskType) -> None:
        spans = answer.get("spans", [])
        if not spans:  # No need to run suggester if we don't have spans
            pass
        # Don't allow spans that are not compatible with the provided suggester
        words = [t["text"] for t in answer["tokens"]]
        spaces = [t.get("ws", True) for t in answer["tokens"]]
        doc = Doc(nlp.vocab, words=words, spaces=spaces)
        # Add annotations from other components
        if use_annotations:
            doc = nlp(doc)
        suggested_spans = suggester([doc])
        suggested_span_tuples = [(s[0], s[1]) for s in suggested_spans.data]
        text = answer["text"]
        annotated = {
            ((s["token_start"], s["token_end"] + 1)): text[s["start"] : s["end"]]
            for s in spans
        }
        for annotated_tuple, text in annotated.items():
            if annotated_tuple not in suggested_span_tuples:
                start, end = annotated_tuple
                err = (
                    f"Span with token offsets {start}:{end} ({text}) "
                    f"is not compatible with the provided suggester function "
                    f"'{suggester_name}'."
                )
                raise ValueError(err)

    return validate_answer


def custom_input_hashes(stream: StreamType, keys: Tuple[str]) -> StreamType:
    for eg in stream:
        eg = set_hashes(eg, input_keys=keys, overwrite=True)
        yield eg

What am I doing wrong?

OK, I figured it out.

The call to get_stream (line 60) has the dedup input argument set as True ("dedup=True), so the "duplicate" samples are removed before I get a chance to rehash them...

Removing the "dedup=True" (the default value is False) solved the problem. Now the recipe behaves as expected.

Thanks for the help! :slight_smile:

1 Like

Apologies for late follow-up! Glad to hear you figured it out and thanks for sharing your solution!

Hi,

I apologize for waking up an old thread, but I have an issue which is related to the one described here.

Just a quick recap: I am annotating samples which consist of text and meta data. The text may repeat itself (with different meta data) in different samples. For example, say that the metal data consists of an "ID", the dataset may look like:

text: "text1", ID: 1
text: "text2", ID: 2
text: "text2", ID: 3
text: "text3", ID: 4

containing N=4 distinct samples with M=3 unique texts. Since the default spans manual recipe treats (hashes) samples based on text, I modified the spans manual recipe to create a custom recipe which takes into account the meta data when hashing (you can check the previous posts).

Now I need to use the review recipe in order to review the annotated samples. At first, I encountered the same problem (hashing base on text alone). I found a fairly simple solution this time: the review recipe actually defines the keys for hashing the input (on line 32), so I created an identical recipe in which I added "meta" as an input key, and that seems to solve the problem.

However, I have encountered a different (strange) problem. When I use the review interface to review the annotations, it only allows M (the no. of unique texts) samples to be listed in the panel on the left, where the samples I had reviewed are listed (rather than N, the actual no. of samples). When I click the "V" button for sample no. M+1, it is moved to the top of the list, but the last sample on the list disappears (with no option to return to it). It looks as if it only allows M samples (as opposed to the N samples contained in the dataset) to be shown on the list.

For example, if I review the example dataset presented above (with N=4, M=3), the first three samples (ID=1,2,3) are processed normally and are shown on the left with 'V' marks next to them. When I approve the forth sample (ID=4), it is presented at the top of the list as it should, but the first sample (ID=1) disappears from the bottom of the list. The interface only allows three samples to be shown on that list.

I imagine this is an issue with the GUI? I couldn't find anything in the code of the recipes that might be connected to this.

I would appreciate any help on this... Thanks in advance!

Hi @shuki,

Please correct me if I'm wrong, but when you refer to the "panel on the left," I assume you're referring to the "HISTORY" listing here:

Just to clarify: the "V" sign in the history panel isn't actually a button on its own—it's a visual indicator reflecting the decision made for that task (in this case, "accept"). For rejected examples, you'll see an "X" displayed there instead.

The size of this history list equals the batch size configured for your annotation session (10 by default). The history view serves two purposes: it gives you an overview of your most recent decisions and allows you to redo them if needed.

History navigation: Clicking on any task in the history panel backtracks the stream to that specific point, which means you'll need to redo all tasks from the selected point onwards. This isn't designed for general navigation between tasks—it's specifically for backtracking the stream. When you click a particular task, you're essentially resetting your stream to that point, which is why the selected task moves to the top and any tasks that came after it disappear (since they're being undone).

This behavior exists because Prodigy uses a stream-based input system, so there's no straightforward way to jump back and forth between arbitrary tasks and then return to your original position.

History size: The history view size is linked to the batch size by default due to Prodigy's buffering mechanism. The system can only keep a limited number of items in the browser's buffer before sending them to the backend and database. Once tasks are sent to the server and stored in the DB, they can no longer be edited from the current UI.

If you need a larger history view, you can increase it by adjusting the batch size (for example, via the batch_size parameter in prodigy.json).