finding areas on pdfs for downstream training

alphie · June 15, 2024, 8:47am

I am working with pdfs. I know how to extract text from pdfs, train a NER or SPAN model, and use them to extract ents or spans.
I now want to train a model to recognise areas of my pdf for downstream processing. The downstream processing will vary.
I know how to annotate a pdf. I have gone through the ljvmiranda approach.

For me, that did reasonably for finding words but did not identify areas at all (even though his example does show areas being found).
I have also done pdf.ocr.correct – so once I have found the area, I can ocr the text.

I am struggling with the workflow to train a model to identify different areas of a pdf eg Header, Footer, Table and then identify them on new pdfs.
Using the ljvmiranda approach the bounding boxes were all tight around the words, even using only annotations of biggish areas for header, footer, table, the model only identified individual words. (I realise one option could be a different HF model. )

I am wondering, before ljvmiranda created this, was there a more “vanilla” workflow for pdfs - - rather like finding paragraphs or figures as shown in https://www.youtube.com/watch?v=rwyze49ne8I but before doing the ocr.

Once have annotations for areas of a pdf, I am looking for a workflow which will:

A: train the model. The equivalent for NER is:

prodigy train ./myNERmodel --ner dataset_name_ner

B. get the model to predict those areas on a new pdf? The equivalent for NER is something like

# First extract the text from the newpdf as jsonl (jsonlfrompdf), then:
nlp = spacy.load(myNERmodel)
Doc = nlp(jsonlfrompdf)
for ent in doc.ents:
    print(ent.text, ent.label_)

I am hoping that this would be part of a pipeline whereby we
have a model to identify the relevant area of the pdf eg head, foot, table
head then goes to a model to process head, foot goes to a different model to process foot etc.

This is a great forum, and I really appreciate the answers and code snippets you provide.

magdaaniol · June 17, 2024, 12:15pm

Hi @alphie,

The reason why the pretrained model used in Lj's workflow did not work for your data might be that the kinds of PDFs are just very different. Another reason could be that there were not enough fine tuning examples for each region. You might need find a pre-trained model that is closer to your kind of data or train a smaller model from scratch.

The most "vanilla" workflow for extracting relevant regions of the of a PDF, would be to convert a PDF to an image and use image.manual to mark the spans. This is exactly what pdf.image.manual does. I think that is the step you refer to when you say "before doing the ocr.".

Then, in order to train a model to recognize the regions in new PDFs, you would need to bring your own training implementation. Prodigy doesn't ship with a built-in recipe for training a computer vision model. Providing training utilities for spaCy text pipelines is easier because we can provide sensible defaults and control the architectures. For Computer Vision there's just a lot less clear as the training details depend very much on the kind data, the architecture used, the framework.. Here you can find one TensorFlow example

Once you found the right framework for you, though it should be easy enough to convert Prodigy image annotations the the required format. You can see an example of the data format here: https://prodi.gy/docs/api-interfaces#image_manual . Finally, with your computer vision pipeline in place, you can definitely compile all components: region recognizer, OCR, spaCy NER in one Python pipeline.

alphie · July 18, 2024, 9:06am

Thanks for the pointers. Is there a video to help get into working through the Tensor flow example?

github.com

explosion/prodigy-recipes/blob/master/image/tf_odapi/image_train.py

import os
import io
import copy
import grpc
import shutil
import functools
import numpy as np
import tensorflow as tf

from PIL import Image
from time import time

from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

from prodigy.components.loaders import get_stream
from prodigy.components.preprocess import fetch_images
from prodigy.core import recipe, recipe_args
from prodigy.util import log, b64_uri_to_bytes, split_string

This file has been truncated. show original

magdaaniol · July 19, 2024, 3:41pm

Hi @alphie,

No, I'm afraid there's not a video tutorial to this recipe - it's been contributed by a community member.

Topic		Replies	Views
Taking a Computer Vision Approach (leveraging image.manual) to build a custom NER model on PDFs usage , ner , image	3	588	July 28, 2022
Annotation strategy for varied pdf layouts	8	84	August 29, 2024
pdf.spans.manual	1	72	December 2, 2024
Adding a helper image textcat , custom , front-end	4	425	November 10, 2022
Classifying pages of a PDF usage , textcat	2	35	June 9, 2025

finding areas on pdfs for downstream training

Related topics