prodigy-ocr.correct ingesting to layoutLM

eemon2 · November 25, 2023, 9:44pm

Hey everyone!
Has anyone used the extension prodigy pdf to feed directly into layoutLM model?
I am finding it hard to actually use the annotated json as a training and training dataset as the structures for FUNDS dataset is very different.

Can anyone help if they have experience on it?

koaning · November 27, 2023, 9:05am

Hi there!

I briefly had a look at LayoutParser when exploring the Prodigy-PDF plugin and noticed that the pretrained models weren't great for the Arxiv papers that I was dealing with. This prohibited me from working on a ocr.correct recipe.

However, I wasn't aware of LayoutFM so that's something that might be worth diving into. It even seems that it is in Huggingface transformers. I don't have experience with it though, so I'll need to properly kick the tires to see if it's something that we might recommend.

One thing about the annotated JSON. It should be relatively straightforward to write a Python script that can convert this data into any other format. If there is a blocker, I'd like to understand why. What's so specific about the dataformat that LayoutFM requires?

Topic		Replies	Views
PDF OCR Image annotation metadata - feature suggestion? usage , best-practices	3	219	May 13, 2024
Adding a helper image textcat , custom , front-end	4	421	November 10, 2022
Annotation strategy for varied pdf layouts	8	79	August 29, 2024
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	164	January 14, 2025
Usecase of Prodigy-PDF ner	1	346	February 8, 2024

prodigy-ocr.correct ingesting to layoutLM

Related topics