I'm annotating a textcat model for text originating from PDFs. The visual format of each document is very different which makes it much easier to annotate based on the image rather than the plain text. Therefore, I'd like to be able to see the page as an image but have the text of the page saved in the db.
I'm a little stuck on how to implement this, please let me know if anyone has any suggestions.
Have you seen my colleague @ljvmiranda921's recent blog where he created a PDF processing workflow with Prodigy?
There is also an accompanying GitHub repo with a spaCy project and custom Prodigy recipes.
I have only dabbled with the project so I don't know all of the details, but it has been very popular with many Prodigy users. What's cool about the project is that it also uses HuggingFace's LayoutLMv3 that combines both text and image masking and fine tunes the model. The project uses the FUNDU dataset so likely to adapt this you'll need to learn how that dataset is structured and mimic it for your own data.
While this may not be a perfect solution, hopefully it provides a concrete idea of an approach. As you've probably seen, we typically recommend (see below) for pdfs either to OCR text and use that text in Prodigy or treat them as images.
Hope this helps and definitely keep us informed on whatever direction you go!
The blog post is super cool. I've been interested in getting into the Layout models but am trying to get a better understanding of Prodigy's workflow before I delve into more complicated projects. I managed to get the images to appear by adding the fetch_images wrapper on my stream.