Usecase of Prodigy-PDF

Can someone suggest some use cases and projects previously done using prodigy-pdf for pdf analysis .
I was looking for a way to extract images and text blocks while preserving the structure from a question-answer solution book i have.Is this a suitable tool for this.and how i should do it?

Hey @AKRking ! Thanks for reaching out.

The prodigy-pdf plug in is relevant for annotation tasks that concern PDFs. You can use the plug in to, for example, label areas of a PDF (treated as an image) that are paragraphs, titles or figures. Or, you could use the pdf.ocr.correct recipe to correct the results of an OCR algorithm to ensure PDF image translations are appropriate. There is more of a walk through in this video here.

As for projects that have used prodigy-pdf for PDF analysis - there is a spaCy project that makes use of the plug in here which focuses on annotating PDFs to finetune a LayoutLMv3 model using FUNSD, a dataset of noisy scanned documents.

I think the plug in would be suitable for your task of identifying images and text blocks from a question-answer solution book. The built-in pdf.image.manual recipe seems most appropriate here. Provided you have a .pdf of the question-answer solution book saved locally on your machine, you could run:

prodigy pdf.image.manual pdfs ./pdfs/ --labels text,image

where pdfs is the name of the dataset to save the labels to and ./pdfs/ is the local path to the folder with your question-answer solution book.

The labelling instance will allow you to annotate text areas and image areas in the solution book so that you can train an appropriate model downstream with the labels.