Classifying pages of a PDF

alphie · June 8, 2025, 8:10am

I would like to extract pages with maps from a pdf. The document contains maps of airports and pages of text. Downstream I will be extracting the map boundaries.

I am wondering if there is a method of classifying the pages in the PDF to find the maps.
Something like:
• Spacy layout to create a spacy doc object from a pdf
• With the recipe textcat.manual to Label Map, Notmap.

The output might be something like
• Page 1, notmap
• Page 2, map

Do you think that would work?

Then, having found the maps, its only really useful if I can automatically extract the pages with the maps. So the output is pdf pages with maps which I can then process.
Any thoughts on the workflow for multipagePDF to PDFmapsOnly (or even better, one pdf per map)

The final step, is to extract the map boundaries; if you think there is anything in the spacy prodigy universe that might be worth exploring, please let me know.

Here is an example document.

magdaaniol · June 9, 2025, 9:40am

Hi @alphie,

If I understand correctly, your goal is to process PDFs as images only and there's no need to extract any text.
In that case, you shouldn't need spacy layout at all. spacy layout is mostly useful for extracting text from PDFs.

It sounds like you need an image classification model and you can prepare the training data with the pdf.image.manual recipe where you can 1) either draw bounding boxes around maps to detect boundaries or 2) set it up to classify entire pages.

For 1) you can use pdf.image.manual as is.
For 2) you'd need to modify the recipe to use the classification view_id. That requires changing the view_id value to "classification" in the return statement on line 113 and setting the label on the example level in the generate_pdf_pages function:

def generate_pdf_pages(pdf_paths: List[Path], split_pages: bool = False):
    """Generate dictionaries that contain an image for each page in the PDF"""
    for pdf_path in pdf_paths:
        pdf = pdfium.PdfDocument(pdf_path)
        n_pages = len(pdf)
        pages = []
        for page_number in range(n_pages):
            pdf_page = pdf.get_page(page_number)
            page = {
                "image": page_to_image(pdf_page),
                "path": str(pdf_path),
                "meta": {
                    "title": pdf_path.name,
                    "page": page_number,
                },
                "label": "MAP" # hardcoding the label for demonstration
            }
            if split_pages:
                yield set_hashes(page)
            else:
                page["view_id"] = "image_manual"
                pages.append(page)
        if not split_pages:
            yield set_hashes(
                {
                    "pages": pages,
                    "meta": {"title": pdf_path.name},
                    "config": {"view_id": "pages"},
                }
            )
        pdf.close()

When in classification mode you want to use the --split-pages option to be able to annotate each page of the PDF.

As for the "multipagePDF to PDFmapsOnly" workflow - once you have a successful image classification model that should be just a question of data post-processing.
I recommend you check out our docs of PDF processing before starting out to make sure you're familiar with all the options available.
Bottom line is that you need a computer vision solution rather than an NLP solution so while you can defnitely prepare your data using Prodigy, you should be looking for dedicated frameworks for the training phase such as PyTorch or TensorFlow with Keras.

alphie · June 9, 2025, 2:11pm

Thank you, that's really helpful

Topic		Replies	Views
visualisation text classification results \| print-stream and extraction of text usage , textcat	3	430	October 1, 2021
pdf.spans.manual	1	56	December 2, 2024
Page Classification of PDF Documents usage , custom	1	942	January 14, 2019
Using prodigy with PDF documents usage	3	4769	February 20, 2018
Text classification - content of a web page usage , textcat , solved	2	700	August 31, 2018

Classifying pages of a PDF

Related topics