finding areas on pdfs for downstream training

I am working with pdfs. I know how to extract text from pdfs, train a NER or SPAN model, and use them to extract ents or spans.
I now want to train a model to recognise areas of my pdf for downstream processing. The downstream processing will vary.
I know how to annotate a pdf. I have gone through the ljvmiranda approach.

For me, that did reasonably for finding words but did not identify areas at all (even though his example does show areas being found).
I have also done pdf.ocr.correct – so once I have found the area, I can ocr the text.

I am struggling with the workflow to train a model to identify different areas of a pdf eg Header, Footer, Table and then identify them on new pdfs.
Using the ljvmiranda approach the bounding boxes were all tight around the words, even using only annotations of biggish areas for header, footer, table, the model only identified individual words. (I realise one option could be a different HF model. )

I am wondering, before ljvmiranda created this, was there a more “vanilla” workflow for pdfs - - rather like finding paragraphs or figures as shown in https://www.youtube.com/watch?v=rwyze49ne8I but before doing the ocr.

Once have annotations for areas of a pdf, I am looking for a workflow which will:

A: train the model. The equivalent for NER is:

prodigy train ./myNERmodel --ner dataset_name_ner

B. get the model to predict those areas on a new pdf? The equivalent for NER is something like

# First extract the text from the newpdf as jsonl (jsonlfrompdf), then:
nlp = spacy.load(myNERmodel)
Doc = nlp(jsonlfrompdf)
for ent in doc.ents:
    print(ent.text, ent.label_)

I am hoping that this would be part of a pipeline whereby we
have a model to identify the relevant area of the pdf eg head, foot, table
head then goes to a model to process head, foot goes to a different model to process foot etc.

This is a great forum, and I really appreciate the answers and code snippets you provide.

Hi @alphie,

The reason why the pretrained model used in Lj's workflow did not work for your data might be that the kinds of PDFs are just very different. Another reason could be that there were not enough fine tuning examples for each region. You might need find a pre-trained model that is closer to your kind of data or train a smaller model from scratch.

The most "vanilla" workflow for extracting relevant regions of the of a PDF, would be to convert a PDF to an image and use image.manual to mark the spans. This is exactly what pdf.image.manual does. I think that is the step you refer to when you say "before doing the ocr.".

Then, in order to train a model to recognize the regions in new PDFs, you would need to bring your own training implementation. Prodigy doesn't ship with a built-in recipe for training a computer vision model. Providing training utilities for spaCy text pipelines is easier because we can provide sensible defaults and control the architectures. For Computer Vision there's just a lot less clear as the training details depend very much on the kind data, the architecture used, the framework.. Here you can find one TensorFlow example

Once you found the right framework for you, though it should be easy enough to convert Prodigy image annotations the the required format. You can see an example of the data format here: https://prodi.gy/docs/api-interfaces#image_manual . Finally, with your computer vision pipeline in place, you can definitely compile all components: region recognizer, OCR, spaCy NER in one Python pipeline.

Thanks for the pointers. Is there a video to help get into working through the Tensor flow example?

Hi @alphie,

No, I'm afraid there's not a video tutorial to this recipe - it's been contributed by a community member.