TL:DR - I'm trying to finetune LaoutLMv3 on my own PDFs based on this structure - LayoutLMv3 expects training data in a pretty specific way which doesn't really comply with the results of the pdf.ocr.correct recipe output, so I added the "metadata" back in - which seems fairly janky
Similar question was asked here before on a pretty much this usecase, but never ended up being followed up on by whoever asked it.
So the custom recipe in the tutorial is great and gets me like, 90% where I want to go. The recipe uses the FUNSD dataset to finetune, which contains images and their annotations in conjunction with their annotations - the custom recipe however uses the prodigy-pdf plugin to first annotate the PDFs with boundary boxes and then apply OCR to the boxes afterwards.
The issue I ran into what that when create annotations with pdf.image.manual
and then you use pdf.ocr.correct
to apply ocr and correct it, the data that prodigy stores from these annotaitons looks like this:
{"label":"date","color":"#ff00ff","x":421.9,"y":120.7,"height":16,"width":59,"center":[451.4,128.7],"type":"rect","points":[[421.9,120.7],[421.9,136.7],[480.9,136.7],[480.9,120.7]],"image":"data:image/png;base64,IMAGEDATA","text":"30.09.2023\n\f","transcription":"30.09.2023\n\f","field_rows":12,"field_label":"Transcript","field_id":"transcription","field_autofocus":false,"_input_hash":-492730241,"_task_hash":-1248503277,"_view_id":"blocks","answer":"accept","_timestamp":1715088990,"_annotator_id":"2024-05-07_15-36-22","_session_id":"2024-05-07_15-36-22"}
So, it was missing any sort of identifyer I could use to connect the annotation back to the PDF file or its' image.
Importantly, LayoutLMv3 expects training data like this (excecpt):
{
"box": [
166,
182,
202,
196
],
"text": "B164",
"label": "answer",
"words": [
{
"box": [
166,
182,
202,
196
],
"text": "B164"
}
],
"linking": [
[
25,
0
]
],
"id": 0
},
importantly, the boxes need to be connected to the text that's inside them.
So the steps I needed:
- Draw boundaries in on the PDF files with
pdf.image.manual
- save the boundaries
- apply OCR inside the boundaries and correct with
pdf.ocr.correct
- connect the data from both steps ideally with an ID
I did solve it with some difficulty, since pdf.ocr.correct
already uses custom loaders - ultimately I simply copied this recipe: pdf.ocr.correct
to a new file, added a single line
annot["meta"] = ex["meta"]
and used the recipe with prodigy pdf.ocr.custom PDF-ocr dataset:PDF --labels $labels -F ./pdf_custom.py
and looking at the data, it now contains the metadata with in turn has the path to the original PDF file.
{"label":"date","color":"#ff00ff","x":421.9,"y":120.7,"height":16,"width":59,"center":[451.4,128.7],"type":"rect","points":[[421.9,120.7],[421.9,136.7],[480.9,136.7],[480.9,120.7]],"image":"data:image/png;base64,IMAGEDATA","text":"30.09.2023\n\f","transcription":"30.09.2023\n\f","meta":{"page":0,"path":"data/115000601062.pdf"},"field_rows":12,"field_label":"Transcript","field_id":"transcription","field_autofocus":false,"_input_hash":-492730241,"_task_hash":-1248503277,"_view_id":"blocks","answer":"accept","_timestamp":1715087504,"_annotator_id":"2024-05-07_15-11-30","_session_id":"2024-05-07_15-11-30"}
Is there a better way to do this?
Importantly, if this is the preferred way, it would be great if the metadata were included by default in the pdf.ocr.correct
recipe, seems like I'm not the only one who was looking for this.
Thanks, if there's a better way or I missed it in the documentation, I appreciate the pointers