Pretrain Model to extract data from PDFs using .jsonl data

I am working on a project to extract data from pdfs. I have played about a bit with annotating the pdf with various labels using Prodigy, and its going quite slowly so far.

However, I have lots and lots of the sort of actual data that I want which (I think) can be exported as .jsonl. I am wondering if there is a way to train (or perhaps pretrain) the model using the .jsonl data and then (somehow - this is what I am unsure of) apply it to my pdfs.

Very much open to ideas.

The data I have is in key, value pairs. So if it was data for an invoice it would look something like this
{"NAME":"Joe Smith","ADD_1":"75 High St","ADD_2":"Warwick","PHONE":01345341687,"ITEM_1":”Cardboard”,"ITEM_2":"Plastic","AMNT_1":70.00, "AMNT_2":80.00, "AMNT_TOT":150.00,"TERMS":"30 DAYS"}

and I want to extract similar data from pdfs of invoices. The Keys will always be the same. I want to create a model that can extract the values from pdfs and assign them to the correct Keys

Welcome to the forum @alphie!

If you have relevant data that can help speed up the annotation it is definitely worth experimenting with. First, let me make sure I understand correctly what resources available.
The extracted data you have in jsonl format looks like:

{"NAME":"Joe Smith","ADD_1":"75 High St","ADD_2":"Warwick","PHONE":01345341687,"ITEM_1":”Cardboard”,"ITEM_2":"Plastic","AMNT_1":70.00, "AMNT_2":80.00, "AMNT_TOT":150.00,"TERMS":"30 DAYS"}

and it comes together with the source it was extracted from. Is the source in .txt or PDF format?

Also, how are you currently annotating PDFs? Are you annotating them as images (using e.g Prodigy PDF plugin) or you're converting them to text and annotate as text spans?

The data I have in jsonl format was derived from the csv files that are used to create the pdfs in the first place. These csv files are available as open data - so there is a lot of data with the labels and values. Hence may idea of trianing a model to recognise the kind of values that each key could have. In the invoice example PHONE looks like a string of numbers, ADD_1 may be a mix of numbers and characters. I am just using the idea of invoices to explain the problem. The real problem would not be amenable to such simple rules, hence the need for a model.

However, there are lots of different software for creating pdfs from the csv files. And even when the same software was used different organisations used different layouts eg address at the top right or top left. I have access to a few different programs for creating the pdfs. So we could pair up the jsonl and equivalent pdfs. It would not be a complete solution, because there are other layouts out there. I have examples of these other layouts but no data to go with them. But I am wondering if I get a partly trained model recognising the kind of values which go with each key, whether that would get it going and we could then extend the model on the different pdfs.

At the moment I am annotating the pdfs using Prodigy PDF plugin. So far I havent managed to make a model from the annotations. And I am exploring how to approach the problem.

I am not quite sure how to convert pdfs to text and then annotate them as text spans (I am new to prodigy). Outside of prodigy I have tried extracting the pdfs as text eg pdfminer, and the text stream is very variable. Is that what you mean, or is there another approach to converting to text and annotating as text spans that I could try.

Right, in order to train a model from the key-value pairs in your jsonl data you'll need the source text. The idea would be to use your key-value pairs to bootstrap the annotation of the source.
Assuming that in production your input is PDFs (no CSVs available), for training you should also use the PDFs you generated from the CSVs files.

How difficult it is to convert the PDF to text depends on how complex it is and what information you need from it. As a first experiment you might try first scraping your PDFs with a library like PyPDF2 and annotate the resulting text with the patterns created from your jsonl files.
For that you might use ner.manual with a custom loader that converts the PDF to text as well as patterns created from your key-value pairs. A custom loader could look something like this (adapted from this post):

from pathlib import Path
import PyPDF2

def get_pdf_stream(pdf_dir):
    pdf_dir = Path(pdf_dir)
    for pdf_file in pdf_dir.iterdir():
        file_obj ='rb')
        reader = PyPDF2.PdfReader(file_obj)
        page_count = len(reader.pages)
        text = ''
        for page in range(page_count):
            page_obj = reader.pages[page]
            text += page_obj.extract_text()
        yield {"text": text}

Here you can read on how to create patterns from your key-value pairs: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP

This way you should be able to quickly reuse your key-value pairs for annotation and training the first model and test it on your production PDFs. The production PDFs would have to be also scraped in the same way and then the model should be applied the the resulting text.
Eventually, you might want to build a spaCy pipeline for this. This would make it easy to experiment with other components such as pre-trained models that should be good for detecting people names and places. Also Entity Ruler could be useful for entities capturable by regex patterns such as dates, numbers, quantities and similar.

One problem I foresee with this simple approach, though is that the structure of the PDF with this simple scraping is completely flattened. So depending on the kind of PDFs you work with, you might want to preserve the layout information.
For that, you'd probably need a more complex pipeline: a model to detect regions of interest, then apply OCR to these regions to convert them to text and, finally, apply an NLP model to the resulting text to extract the values for your keys.
This workflow is what Prodigy PDF plugin implements. Also I recommend you check out this spaCy project for detecting the right regions on PDFs with a help of a pre-trained model.

Either way, the NLP model would be the final component of the entire pipeline. I think what you should focus on first is to find the best way for pre-processing PDFs for your use case. This, as mentioned before, would depend on how complex your PDFs are and to what extend the structure/layout is relevant for the extraction of the information.

Thank you for the full answer.

I understand the first option (ignoring pdf structure) comprises

  1. Read in pdf as text stream (eg using get_pdf_stream).
  2. Create patterns from key value pairs using match patterns
  3. Reuse key value pairs for annotation and training the first model
    Can you tell me more about how I would “reuse key value pairs for annotation”. What recipe would I use?
    (I am also looking at the pdf plugin and suggested spacy project, but will leave questions on that for another day)

Hi @alphie ,

That's right. These would be the first steps. Hopefully simple scraping results in valuable text.
By “reuse key value pairs for annotation” I mean to use the patterns created in step 2 for boostrapping NER annotation by loading these patterns to ner.manual recipe. Here you can find a detailed explanation of the workflow: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP

Remember that you'd need to provide your custom loader to the built-in ner.manual or write a ner.manual custom recipe if you prefer, but with just a custom loader function you should be fine.