prodigy-ocr.correct ingesting to layoutLM

Hey everyone!
Has anyone used the extension prodigy pdf to feed directly into layoutLM model?
I am finding it hard to actually use the annotated json as a training and training dataset as the structures for FUNDS dataset is very different.

Can anyone help if they have experience on it?

Hi there!

I briefly had a look at LayoutParser when exploring the Prodigy-PDF plugin and noticed that the pretrained models weren't great for the Arxiv papers that I was dealing with. This prohibited me from working on a ocr.correct recipe.

However, I wasn't aware of LayoutFM so that's something that might be worth diving into. It even seems that it is in Huggingface transformers. I don't have experience with it though, so I'll need to properly kick the tires to see if it's something that we might recommend.

One thing about the annotated JSON. It should be relatively straightforward to write a Python script that can convert this data into any other format. If there is a blocker, I'd like to understand why. What's so specific about the dataformat that LayoutFM requires?