Pretrain Model to extract data from PDFs using .jsonl data

magdaaniol · April 25, 2024, 9:04am

Right, in order to train a model from the key-value pairs in your jsonl data you'll need the source text. The idea would be to use your key-value pairs to bootstrap the annotation of the source.
Assuming that in production your input is PDFs (no CSVs available), for training you should also use the PDFs you generated from the CSVs files.

How difficult it is to convert the PDF to text depends on how complex it is and what information you need from it. As a first experiment you might try first scraping your PDFs with a library like PyPDF2 and annotate the resulting text with the patterns created from your jsonl files.
For that you might use ner.manual with a custom loader that converts the PDF to text as well as patterns created from your key-value pairs. A custom loader could look something like this (adapted from this post):

from pathlib import Path
import PyPDF2

def get_pdf_stream(pdf_dir):
    pdf_dir = Path(pdf_dir)
    for pdf_file in pdf_dir.iterdir():
        file_obj = pdf_file.open('rb')
        reader = PyPDF2.PdfReader(file_obj)
        page_count = len(reader.pages)
        text = ''
        for page in range(page_count):
            page_obj = reader.pages[page]
            text += page_obj.extract_text()
        yield {"text": text}

Here you can read on how to create patterns from your key-value pairs: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP

This way you should be able to quickly reuse your key-value pairs for annotation and training the first model and test it on your production PDFs. The production PDFs would have to be also scraped in the same way and then the model should be applied the the resulting text.
Eventually, you might want to build a spaCy pipeline for this. This would make it easy to experiment with other components such as pre-trained models that should be good for detecting people names and places. Also Entity Ruler could be useful for entities capturable by regex patterns such as dates, numbers, quantities and similar.

One problem I foresee with this simple approach, though is that the structure of the PDF with this simple scraping is completely flattened. So depending on the kind of PDFs you work with, you might want to preserve the layout information.
For that, you'd probably need a more complex pipeline: a model to detect regions of interest, then apply OCR to these regions to convert them to text and, finally, apply an NLP model to the resulting text to extract the values for your keys.
This workflow is what Prodigy PDF plugin implements. Also I recommend you check out this spaCy project for detecting the right regions on PDFs with a help of a pre-trained model.

Either way, the NLP model would be the final component of the entire pipeline. I think what you should focus on first is to find the best way for pre-processing PDFs for your use case. This, as mentioned before, would depend on how complex your PDFs are and to what extend the structure/layout is relevant for the extraction of the information.

Topic		Replies	Views
LABELS showing as TXT in DB-Output JSONL && PDF-Prodigy Approach ner , install , custom	1	158	May 25, 2024
Prepopulating the image.manual .jsonl data	1	84	May 28, 2024
Image Manual (How to use my .jsonl after I import them) usage , image , solved	6	464	April 4, 2020
Need to create a jsonl file on python according to certain format usage , third-party	1	810	October 2, 2019
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	144	January 14, 2025

Pretrain Model to extract data from PDFs using .jsonl data

Related topics