PDF OCR Image annotation metadata - feature suggestion?

PaulBFB · May 7, 2024, 1:54pm

TL:DR - I'm trying to finetune LaoutLMv3 on my own PDFs based on this structure - LayoutLMv3 expects training data in a pretty specific way which doesn't really comply with the results of the pdf.ocr.correct recipe output, so I added the "metadata" back in - which seems fairly janky

Similar question was asked here before on a pretty much this usecase, but never ended up being followed up on by whoever asked it.

So the custom recipe in the tutorial is great and gets me like, 90% where I want to go. The recipe uses the FUNSD dataset to finetune, which contains images and their annotations in conjunction with their annotations - the custom recipe however uses the prodigy-pdf plugin to first annotate the PDFs with boundary boxes and then apply OCR to the boxes afterwards.

The issue I ran into what that when create annotations with pdf.image.manual and then you use pdf.ocr.correct to apply ocr and correct it, the data that prodigy stores from these annotaitons looks like this:

{"label":"date","color":"#ff00ff","x":421.9,"y":120.7,"height":16,"width":59,"center":[451.4,128.7],"type":"rect","points":[[421.9,120.7],[421.9,136.7],[480.9,136.7],[480.9,120.7]],"image":"data:image/png;base64,IMAGEDATA","text":"30.09.2023\n\f","transcription":"30.09.2023\n\f","field_rows":12,"field_label":"Transcript","field_id":"transcription","field_autofocus":false,"_input_hash":-492730241,"_task_hash":-1248503277,"_view_id":"blocks","answer":"accept","_timestamp":1715088990,"_annotator_id":"2024-05-07_15-36-22","_session_id":"2024-05-07_15-36-22"}

So, it was missing any sort of identifyer I could use to connect the annotation back to the PDF file or its' image.
Importantly, LayoutLMv3 expects training data like this (excecpt):

    {
            "box": [
                166,
                182,
                202,
                196
            ],
            "text": "B164",
            "label": "answer",
            "words": [
                {
                    "box": [
                        166,
                        182,
                        202,
                        196
                    ],
                    "text": "B164"
                }
            ],
            "linking": [
                [
                    25,
                    0
                ]
            ],
            "id": 0
        },

importantly, the boxes need to be connected to the text that's inside them.

So the steps I needed:

Draw boundaries in on the PDF files with pdf.image.manual
save the boundaries
apply OCR inside the boundaries and correct with pdf.ocr.correct
connect the data from both steps ideally with an ID

I did solve it with some difficulty, since pdf.ocr.correct already uses custom loaders - ultimately I simply copied this recipe: pdf.ocr.correct to a new file, added a single line

annot["meta"] = ex["meta"]

and used the recipe with prodigy pdf.ocr.custom PDF-ocr dataset:PDF --labels $labels -F ./pdf_custom.py and looking at the data, it now contains the metadata with in turn has the path to the original PDF file.

{"label":"date","color":"#ff00ff","x":421.9,"y":120.7,"height":16,"width":59,"center":[451.4,128.7],"type":"rect","points":[[421.9,120.7],[421.9,136.7],[480.9,136.7],[480.9,120.7]],"image":"data:image/png;base64,IMAGEDATA","text":"30.09.2023\n\f","transcription":"30.09.2023\n\f","meta":{"page":0,"path":"data/115000601062.pdf"},"field_rows":12,"field_label":"Transcript","field_id":"transcription","field_autofocus":false,"_input_hash":-492730241,"_task_hash":-1248503277,"_view_id":"blocks","answer":"accept","_timestamp":1715087504,"_annotator_id":"2024-05-07_15-11-30","_session_id":"2024-05-07_15-11-30"}

Is there a better way to do this?
Importantly, if this is the preferred way, it would be great if the metadata were included by default in the pdf.ocr.correct recipe, seems like I'm not the only one who was looking for this.

Thanks, if there's a better way or I missed it in the documentation, I appreciate the pointers

magdaaniol · May 9, 2024, 7:37am

Welcome to the forum @PaulBFB,

Thanks for sharing the ample context for the issue. Indeed, looking at it, there's no reason not to propagate the reference to the document in pdf.ocr.correct output. I imagine it should always be required for training.
If I understand correctly, that's the only missing piece of information to be able to translate the Prodigy output to the format required by LayoutLMv3?
The formatting to particular formats will naturally have to be done by scripts outside Prodigy, but it's obvious we should be providing all the information in the output to make it easier. Thanks again for the suggestion!

PaulBFB · May 10, 2024, 7:56am

Hi Magda!

In short, yes, you're exactly right. All that's missing is the reference to the document, actually ideally I'd create a PR for that single line in prodigy-pdf. Should I just do that? I haven't seen anything on contributing (admittedly, I've only given it a cursory glance).

As for the formatting, again, you're right that needs to be done outside of the recipe anyway, which is what I ended up doing.

Just wanted to point out that in case of PDF files, I think most people will need the reference anyway.

Thanks for your answer.

magdaaniol · May 13, 2024, 9:59am

Hi @PaulBFB ,

Thanks for the confirmation, in that case, if you're up for it, we'd really appreciate a PR! You should be able to just fork and open a PR (we haven't got round to prepping the contribution instructions yet - sorry!)

Topic		Replies	Views
prodigy-ocr.correct ingesting to layoutLM	1	257	November 27, 2023
Adding a helper image textcat , custom , front-end	4	421	November 10, 2022
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	164	January 14, 2025
Annotation strategy for varied pdf layouts	8	79	August 29, 2024
Annotating PDFs by drawing bounding box around fields usage , front-end	1	2675	February 27, 2019

PDF OCR Image annotation metadata - feature suggestion?

Related topics