Prepopulating the image.manual .jsonl data

I have pdfs many of which are identical and I am looking for a way to speed up annotations
I am interested in knowing more about the idea of “pre-populating them in the data”
As mentioned in

Lets say I have 50 pdfs with an identical layout.
Is there a workflow by which I could annotate one of my identical pdfs and save it as .jsonl.
and then put the same bounding boxes and labels on pdfs 2-50?

Hi @alphie,

The easiest way to "prepopulate" your input jsonl with bounding boxes from the "reference" annotation would be via a python script outside Prodigy.
This script should read your "reference" example and copy the value of the spans key to all the other input examples.
spans is where the bounding boxes are stored. Please note that the bounding boxes defined as pixel offsets relative to the the image height and width so if the PDFs have different heights and widths the bounding boxes will not fall in exactly the same spots.
The script could look something like this:

from typing import Dict, List

import srsly
from import get_stream
from prodigy.types import StreamType
from wasabi import msg

def add_spans(stream: StreamType, spans: List[Dict]) -> StreamType:
    for example in stream:
        if "spans" in example:
                f"Example with _task_hash {example.get('_task_hash')} already contains bounding boxes. Leaving as is."
            example["spans"] = spans
        yield example

# Load the reference annotation
reference = next(srsly.read_jsonl("template_annotation.jsonl"), None)
if reference is None or "spans" not in reference:
    msg.error("The reference annotation does not contain `spans`", exits=1)

reference_spans = reference["spans"]

# Get the stream and apply spans
stream = get_stream("input_data.jsonl")
stream.apply(add_spans, stream=stream, spans=reference_spans)

# Write the annotated stream to a file
output_path = "./preannotated_data.jsonl"
srsly.write_jsonl(output_path, list(stream))"Preannotated dataset saved at {output_path}")

The preannotated jsonl (preannotated_data.jsonl) should be ready to use with image.manual for curation.