Adding a helper image

I'm annotating a textcat model for text originating from PDFs. The visual format of each document is very different which makes it much easier to annotate based on the image rather than the plain text. Therefore, I'd like to be able to see the page as an image but have the text of the page saved in the db.

I'm a little stuck on how to implement this, please let me know if anyone has any suggestions.

hi @jordandavis!

Have you seen my colleague @ljvmiranda921's recent blog where he created a PDF processing workflow with Prodigy?

There is also an accompanying GitHub repo with a spaCy project and custom Prodigy recipes.

I have only dabbled with the project so I don't know all of the details, but it has been very popular with many Prodigy users. What's cool about the project is that it also uses HuggingFace's LayoutLMv3 that combines both text and image masking and fine tunes the model. The project uses the FUNDU dataset so likely to adapt this you'll need to learn how that dataset is structured and mimic it for your own data.

While this may not be a perfect solution, hopefully it provides a concrete idea of an approach. As you've probably seen, we typically recommend (see below) for pdfs either to OCR text and use that text in Prodigy or treat them as images.

Hope this helps and definitely keep us informed on whatever direction you go!

Currently we are working on an invoice processing problem where we need pdf annotation tools. We plan to use LayoutLM post annotation.

  1. Can we directly import the pdf and pdf's ocr file directly into Prodigy?
  2. Do you provide an export option which is compatible with LayoutLM?
  3. Do you provide active learning while doing the annotation?
  4. Do you support the training option for LayoutLM directly?

hi @Khadke_C,

I moved your post about @ljvmiranda921's post/project because I think my answer yesterday answers a lot of your questions.

Let us know if you have questions!

Thanks Ryan,

The blog post is super cool. I've been interested in getting into the Layout models but am trying to get a better understanding of Prodigy's workflow before I delve into more complicated projects. I managed to get the images to appear by adding the fetch_images wrapper on my stream.