Prepopulating the image.manual .jsonl data

alphie · May 27, 2024, 4:06pm

I have pdfs many of which are identical and I am looking for a way to speed up annotations
I am interested in knowing more about the idea of “pre-populating them in the data”
As mentioned in

Lets say I have 50 pdfs with an identical layout.
Is there a workflow by which I could annotate one of my identical pdfs and save it as .jsonl.
and then put the same bounding boxes and labels on pdfs 2-50?

magdaaniol · May 28, 2024, 8:36am

Hi @alphie,

The easiest way to "prepopulate" your input jsonl with bounding boxes from the "reference" annotation would be via a python script outside Prodigy.
This script should read your "reference" example and copy the value of the spans key to all the other input examples.
spans is where the bounding boxes are stored. Please note that the bounding boxes defined as pixel offsets relative to the the image height and width so if the PDFs have different heights and widths the bounding boxes will not fall in exactly the same spots.
The script could look something like this:

from typing import Dict, List

import srsly
from prodigy.components.stream import get_stream
from prodigy.types import StreamType
from wasabi import msg


def add_spans(stream: StreamType, spans: List[Dict]) -> StreamType:
    for example in stream:
        if "spans" in example:
            msg.warn(
                f"Example with _task_hash {example.get('_task_hash')} already contains bounding boxes. Leaving as is."
            )
        else:
            example["spans"] = spans
        yield example


# Load the reference annotation
reference = next(srsly.read_jsonl("template_annotation.jsonl"), None)
if reference is None or "spans" not in reference:
    msg.error("The reference annotation does not contain `spans`", exits=1)

reference_spans = reference["spans"]

# Get the stream and apply spans
stream = get_stream("input_data.jsonl")
stream.apply(add_spans, stream=stream, spans=reference_spans)

# Write the annotated stream to a file
output_path = "./preannotated_data.jsonl"
srsly.write_jsonl(output_path, list(stream))
msg.info(f"Preannotated dataset saved at {output_path}")

The preannotated jsonl (preannotated_data.jsonl) should be ready to use with image.manual for curation.

Topic		Replies	Views
Annotating PDFs by drawing bounding box around fields usage , front-end	1	2670	February 27, 2019
Pretrain Model to extract data from PDFs using .jsonl data	5	514	May 9, 2024
LABELS showing as TXT in DB-Output JSONL && PDF-Prodigy Approach ner , install , custom	1	158	May 25, 2024
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	152	January 14, 2025
Using image.manual to correct bounding box annotations usage , image , solved	2	635	December 11, 2020

Prepopulating the image.manual .jsonl data

Related topics