PDF OCR Image annotation metadata - feature suggestion?

TL:DR - I'm trying to finetune LaoutLMv3 on my own PDFs based on this structure - LayoutLMv3 expects training data in a pretty specific way which doesn't really comply with the results of the pdf.ocr.correct recipe output, so I added the "metadata" back in - which seems fairly janky

Similar question was asked here before on a pretty much this usecase, but never ended up being followed up on by whoever asked it.

So the custom recipe in the tutorial is great and gets me like, 90% where I want to go. The recipe uses the FUNSD dataset to finetune, which contains images and their annotations in conjunction with their annotations - the custom recipe however uses the prodigy-pdf plugin to first annotate the PDFs with boundary boxes and then apply OCR to the boxes afterwards.

The issue I ran into what that when create annotations with pdf.image.manual and then you use pdf.ocr.correct to apply ocr and correct it, the data that prodigy stores from these annotaitons looks like this:

{"label":"date","color":"#ff00ff","x":421.9,"y":120.7,"height":16,"width":59,"center":[451.4,128.7],"type":"rect","points":[[421.9,120.7],[421.9,136.7],[480.9,136.7],[480.9,120.7]],"image":"data:image/png;base64,IMAGEDATA","text":"30.09.2023\n\f","transcription":"30.09.2023\n\f","field_rows":12,"field_label":"Transcript","field_id":"transcription","field_autofocus":false,"_input_hash":-492730241,"_task_hash":-1248503277,"_view_id":"blocks","answer":"accept","_timestamp":1715088990,"_annotator_id":"2024-05-07_15-36-22","_session_id":"2024-05-07_15-36-22"}

So, it was missing any sort of identifyer I could use to connect the annotation back to the PDF file or its' image.
Importantly, LayoutLMv3 expects training data like this (excecpt):

    {
            "box": [
                166,
                182,
                202,
                196
            ],
            "text": "B164",
            "label": "answer",
            "words": [
                {
                    "box": [
                        166,
                        182,
                        202,
                        196
                    ],
                    "text": "B164"
                }
            ],
            "linking": [
                [
                    25,
                    0
                ]
            ],
            "id": 0
        },

importantly, the boxes need to be connected to the text that's inside them.

So the steps I needed:

  • Draw boundaries in on the PDF files with pdf.image.manual
  • save the boundaries
  • apply OCR inside the boundaries and correct with pdf.ocr.correct
  • connect the data from both steps ideally with an ID

I did solve it with some difficulty, since pdf.ocr.correct already uses custom loaders - ultimately I simply copied this recipe: pdf.ocr.correct to a new file, added a single line

annot["meta"] = ex["meta"]

and used the recipe with prodigy pdf.ocr.custom PDF-ocr dataset:PDF --labels $labels -F ./pdf_custom.py and looking at the data, it now contains the metadata with in turn has the path to the original PDF file.

{"label":"date","color":"#ff00ff","x":421.9,"y":120.7,"height":16,"width":59,"center":[451.4,128.7],"type":"rect","points":[[421.9,120.7],[421.9,136.7],[480.9,136.7],[480.9,120.7]],"image":"data:image/png;base64,IMAGEDATA","text":"30.09.2023\n\f","transcription":"30.09.2023\n\f","meta":{"page":0,"path":"data/115000601062.pdf"},"field_rows":12,"field_label":"Transcript","field_id":"transcription","field_autofocus":false,"_input_hash":-492730241,"_task_hash":-1248503277,"_view_id":"blocks","answer":"accept","_timestamp":1715087504,"_annotator_id":"2024-05-07_15-11-30","_session_id":"2024-05-07_15-11-30"}

Is there a better way to do this?
Importantly, if this is the preferred way, it would be great if the metadata were included by default in the pdf.ocr.correct recipe, seems like I'm not the only one who was looking for this.

Thanks, if there's a better way or I missed it in the documentation, I appreciate the pointers

Welcome to the forum @PaulBFB,

Thanks for sharing the ample context for the issue. Indeed, looking at it, there's no reason not to propagate the reference to the document in pdf.ocr.correct output. I imagine it should always be required for training.
If I understand correctly, that's the only missing piece of information to be able to translate the Prodigy output to the format required by LayoutLMv3?
The formatting to particular formats will naturally have to be done by scripts outside Prodigy, but it's obvious we should be providing all the information in the output to make it easier. Thanks again for the suggestion!

Hi Magda!

In short, yes, you're exactly right. All that's missing is the reference to the document, actually ideally I'd create a PR for that single line in prodigy-pdf. Should I just do that? I haven't seen anything on contributing (admittedly, I've only given it a cursory glance).

As for the formatting, again, you're right that needs to be done outside of the recipe anyway, which is what I ended up doing.

Just wanted to point out that in case of PDF files, I think most people will need the reference anyway.

Thanks for your answer.

Hi @PaulBFB ,

Thanks for the confirmation, in that case, if you're up for it, we'd really appreciate a PR! You should be able to just fork and open a PR (we haven't got round to prepping the contribution instructions yet - sorry!)