Inconsistency Number of Annotated Data

Hi @sigitpurnomo,

Yes, the input_hash is computed from the string value of the input text so any changes to the string will result in a different input_hash value. And if the same target dataset was used before stopping the server and the original example was saved in the DB - that would result in dataset containing examples that are not in the current input file.

We can definitely filter out these examples from the final dataset via python script, but for the future, perhaps a more efficient workflow would be to instruct the annotators to ignore (by hitting the ignore button) examples that need editing and postprocess them in a single pass i.e.

  1. once the annotation is done, filter out the ignored examples
  2. edit them according to your needs and save in a separate jsonl file
  3. set up another annotation session just with these edited examples

On to the remaining questions:

  1. Is it possible to make the duplicate text appear for the annotation?

Are you asking whether it's possible to make an annotator annotate exactly the same question more than once?
For that you'd need to set dedup to False in the get_stream function. Currently, your recipe is using the legacy JSONL loader. A recommended way to load the source is via get_stream function.
That would prevent the removal of duplicates from the input stream.
Then, another place where filtering happens is within each session. Whenever a session asks for examples, the candidate examples from the stream will be checked against the examples already annotated by this session based on the input or task hash (depending on the exclude_by config setting).
So in your case, if the examples are totally identical i.e. they would have the same input hash and the same task hash - you would need to differentiate between them using a custom hashing function that takes into account some attribute that differentiates between the examples e.g. a custom field in the input file e.g. "copy": 1. This new field should be used together with text to compute the input hash.
If, however, by duplicates you mean the same input text but different question about it e.g. different options in the choice block - then you can set exclude_by: task in your .prodigy.json - in fact, this is the default setting so you shouldn't have to change anything unless it's set to input in your current setup.
I wonder what's the purpose of sending duplicates - is it computing intra-annotator agreement as opposed to (or on top of) inter-annotator agreement?

  1. How do we remove the extra annotations from the dataset (database)

You'd need to filter out your current target dataset (both main and session datasets) and remove the examples that are not present in the current input dataset. Here's the example script that could do that. It saves the copies of edited datasets as new jsonl files in current working directory:

import sys
from typing import List, Set

import srsly

from prodigy.components.db import connect
from prodigy.components.stream import get_stream
from prodigy.types import StreamType


def filter_out_examples(stream: StreamType, hashes: Set[int]) -> StreamType:
    """Filter out examples with specific hashes from the stream."""
    for example in stream:
        if example["_input_hash"] not in hashes:
            yield example


def get_stream_hashes(db, dataset_name: str) -> List[int]:
    """Get hashes from a dataset."""
    return db.get_input_hashes(dataset_name)


def save_dataset(stream: StreamType, dataset_name: str) -> None:
    """Save a dataset stream to a JSONL file."""
    output_filename = f"{dataset_name}_edited.jsonl"
    srsly.write_jsonl(output_filename, stream)
    print(f"Saved an updated copy of {dataset_name} as {output_filename}")


def get_extra_hashes(
    input_hashes: List[int], session_hashes: List[List[int]]
) -> Set[int]:
    """Get hashes that are in session data but not in input."""
    extra_hashes = set()
    for session_hash_list in session_hashes:
        extra_hashes.update(set(session_hash_list) - set(input_hashes))
    return extra_hashes


def clean_datasets(
    input_filename: str,
    main_dataset: str,
    annotator1_dataset: str,
    annotator2_dataset: str,
) -> None:
    """Clean the datasets by removing extra examples."""
    print("Processing datasets:")
    print(f"Input file: {input_filename}")
    print(f"Main dataset: {main_dataset}")
    print(f"Annotator 1 dataset: {annotator1_dataset}")
    print(f"Annotator 2 dataset: {annotator2_dataset}")

    # Connect to database
    db = connect()

    # Get input stream and hashes
    input_stream = get_stream(input_filename, dedup=False)
    input_hashes = [ex["_input_hash"] for ex in input_stream]

    # Get dataset streams
    streams = {
        "main": get_stream(f"dataset:{main_dataset}"),
        "annotator1": get_stream(f"dataset:{annotator1_dataset}"),
        "annotator2": get_stream(f"dataset:{annotator2_dataset}"),
    }

    # Get session hashes
    session_hashes = [
        get_stream_hashes(db, annotator1_dataset),
        get_stream_hashes(db, annotator2_dataset),
    ]

    # Find extra examples
    extra_hashes = get_extra_hashes(input_hashes, session_hashes)

    if extra_hashes:
        print(f"Found {len(extra_hashes)} examples to filter out.")
    else:
        print("No examples to filter out")
        sys.exit(0)

    # Filter and save datasets
    dataset_names = {
        "main": main_dataset,
        "annotator1": annotator1_dataset,
        "annotator2": annotator2_dataset,
    }

    for key, stream in streams.items():
        stream.apply(filter_out_examples, stream=stream, hashes=extra_hashes)
        save_dataset(stream, dataset_names[key])


def main():
    clean_datasets(
        input_filename="input.jsonl",  # replace with your input filename
        main_dataset="main_dataset",  # replace with target dataset
        annotator1_dataset="main_dataset-session",  # replace with annotator 1 dataset
        annotator2_dataset="main_dataset-session",  # replace with annotator 2 dataset
    )


if __name__ == "__main__":
    main()