Hi @sigitpurnomo,
Yes, the input_hash is computed from the string value of the input text so any changes to the string will result in a different input_hash value. And if the same target dataset was used before stopping the server and the original example was saved in the DB - that would result in dataset containing examples that are not in the current input file.
We can definitely filter out these examples from the final dataset via python script, but for the future, perhaps a more efficient workflow would be to instruct the annotators to ignore
(by hitting the ignore button) examples that need editing and postprocess them in a single pass i.e.
- once the annotation is done, filter out the ignored examples
- edit them according to your needs and save in a separate jsonl file
- set up another annotation session just with these edited examples
On to the remaining questions:
- Is it possible to make the duplicate text appear for the annotation?
Are you asking whether it's possible to make an annotator annotate exactly the same question more than once?
For that you'd need to set dedup
to False in the get_stream
function. Currently, your recipe is using the legacy JSONL
loader. A recommended way to load the source is via get_stream
function.
That would prevent the removal of duplicates from the input stream.
Then, another place where filtering happens is within each session. Whenever a session asks for examples, the candidate examples from the stream will be checked against the examples already annotated by this session based on the input or task hash (depending on the exclude_by
config setting).
So in your case, if the examples are totally identical i.e. they would have the same input hash and the same task hash - you would need to differentiate between them using a custom hashing function that takes into account some attribute that differentiates between the examples e.g. a custom field in the input file e.g. "copy": 1
. This new field should be used together with text
to compute the input hash.
If, however, by duplicates you mean the same input text but different question about it e.g. different options in the choice
block - then you can set exclude_by
: task
in your .prodigy.json
- in fact, this is the default setting so you shouldn't have to change anything unless it's set to input
in your current setup.
I wonder what's the purpose of sending duplicates - is it computing intra-annotator agreement as opposed to (or on top of) inter-annotator agreement?
- How do we remove the extra annotations from the dataset (database)
You'd need to filter out your current target dataset (both main and session datasets) and remove the examples that are not present in the current input dataset. Here's the example script that could do that. It saves the copies of edited datasets as new jsonl files in current working directory:
import sys
from typing import List, Set
import srsly
from prodigy.components.db import connect
from prodigy.components.stream import get_stream
from prodigy.types import StreamType
def filter_out_examples(stream: StreamType, hashes: Set[int]) -> StreamType:
"""Filter out examples with specific hashes from the stream."""
for example in stream:
if example["_input_hash"] not in hashes:
yield example
def get_stream_hashes(db, dataset_name: str) -> List[int]:
"""Get hashes from a dataset."""
return db.get_input_hashes(dataset_name)
def save_dataset(stream: StreamType, dataset_name: str) -> None:
"""Save a dataset stream to a JSONL file."""
output_filename = f"{dataset_name}_edited.jsonl"
srsly.write_jsonl(output_filename, stream)
print(f"Saved an updated copy of {dataset_name} as {output_filename}")
def get_extra_hashes(
input_hashes: List[int], session_hashes: List[List[int]]
) -> Set[int]:
"""Get hashes that are in session data but not in input."""
extra_hashes = set()
for session_hash_list in session_hashes:
extra_hashes.update(set(session_hash_list) - set(input_hashes))
return extra_hashes
def clean_datasets(
input_filename: str,
main_dataset: str,
annotator1_dataset: str,
annotator2_dataset: str,
) -> None:
"""Clean the datasets by removing extra examples."""
print("Processing datasets:")
print(f"Input file: {input_filename}")
print(f"Main dataset: {main_dataset}")
print(f"Annotator 1 dataset: {annotator1_dataset}")
print(f"Annotator 2 dataset: {annotator2_dataset}")
# Connect to database
db = connect()
# Get input stream and hashes
input_stream = get_stream(input_filename, dedup=False)
input_hashes = [ex["_input_hash"] for ex in input_stream]
# Get dataset streams
streams = {
"main": get_stream(f"dataset:{main_dataset}"),
"annotator1": get_stream(f"dataset:{annotator1_dataset}"),
"annotator2": get_stream(f"dataset:{annotator2_dataset}"),
}
# Get session hashes
session_hashes = [
get_stream_hashes(db, annotator1_dataset),
get_stream_hashes(db, annotator2_dataset),
]
# Find extra examples
extra_hashes = get_extra_hashes(input_hashes, session_hashes)
if extra_hashes:
print(f"Found {len(extra_hashes)} examples to filter out.")
else:
print("No examples to filter out")
sys.exit(0)
# Filter and save datasets
dataset_names = {
"main": main_dataset,
"annotator1": annotator1_dataset,
"annotator2": annotator2_dataset,
}
for key, stream in streams.items():
stream.apply(filter_out_examples, stream=stream, hashes=extra_hashes)
save_dataset(stream, dataset_names[key])
def main():
clean_datasets(
input_filename="input.jsonl", # replace with your input filename
main_dataset="main_dataset", # replace with target dataset
annotator1_dataset="main_dataset-session", # replace with annotator 1 dataset
annotator2_dataset="main_dataset-session", # replace with annotator 2 dataset
)
if __name__ == "__main__":
main()