Adding a helper image

jordandavis · November 9, 2022, 5:50pm

I'm annotating a textcat model for text originating from PDFs. The visual format of each document is very different which makes it much easier to annotate based on the image rather than the plain text. Therefore, I'd like to be able to see the page as an image but have the text of the page saved in the db.

I'm a little stuck on how to implement this, please let me know if anyone has any suggestions.

ryanwesslen · November 9, 2022, 10:34pm

hi @jordandavis!

Have you seen my colleague @ljvmiranda921's recent blog where he created a PDF processing workflow with Prodigy?

There is also an accompanying GitHub repo with a spaCy project and custom Prodigy recipes.

I have only dabbled with the project so I don't know all of the details, but it has been very popular with many Prodigy users. What's cool about the project is that it also uses HuggingFace's LayoutLMv3 that combines both text and image masking and fine tunes the model. The project uses the FUNDU dataset so likely to adapt this you'll need to learn how that dataset is structured and mimic it for your own data.

While this may not be a perfect solution, hopefully it provides a concrete idea of an approach. As you've probably seen, we typically recommend (see below) for pdfs either to OCR text and use that text in Prodigy or treat them as images.

Hope this helps and definitely keep us informed on whatever direction you go!

Khadke_C · November 10, 2022, 12:12pm

Currently we are working on an invoice processing problem where we need pdf annotation tools. We plan to use LayoutLM post annotation.

Can we directly import the pdf and pdf's ocr file directly into Prodigy?
Do you provide an export option which is compatible with LayoutLM?
Do you provide active learning while doing the annotation?
Do you support the training option for LayoutLM directly?

ryanwesslen · November 10, 2022, 1:33pm

hi @Khadke_C,

I moved your post about @ljvmiranda921's post/project because I think my answer yesterday answers a lot of your questions.

Let us know if you have questions!

jordandavis · November 10, 2022, 3:25pm

Thanks Ryan,

The blog post is super cool. I've been interested in getting into the Layout models but am trying to get a better understanding of Prodigy's workflow before I delve into more complicated projects. I managed to get the images to appear by adding the fetch_images wrapper on my stream.

Topic		Replies	Views
Using prodigy with PDF documents usage	3	4788	February 20, 2018
Taking a Computer Vision Approach (leveraging image.manual) to build a custom NER model on PDFs usage , ner , image	3	592	July 28, 2022
Page Classification of PDF Documents usage , custom	1	944	January 14, 2019
Annotating PDFs by drawing bounding box around fields usage , front-end	1	2710	February 27, 2019
Document layout analysis usage , image , custom	6	1179	March 10, 2021

Adding a helper image

Related topics