Extracting data from PDFs using prodigy

kgodonoghue · May 6, 2022, 8:04pm

Hi, interested in thoughts for how best to get PDF documents into a format that will work best for extraction of data, what is the best format to get the PDF documents into? Json or spacy format or other, any inputs appreciated

ryanwesslen · May 9, 2022, 1:37am

Hi Kieran!

Thanks for your question! Just curious, what is the structure of the data in the PDF files? Raw text, tables, figures, images?

To get started, have you seen this post? Using prodigy with PDF documents

I like Ines' approach because it skips the step of saving intermediary files (.txt, .spacy, .jsonl) and writes the files as a generator that can be piped to prodigy. I've also had success with PyMuPDF too and output documents as .txt or .jsonl. With either of these you may want to also consider doing sentence segmentation before hand to break up the documents.

There are other approaches like embedding the pdf directly too that may help: How to embed a PDF file to a recipe.

ryanwesslen · June 24, 2022, 4:56pm

FYI there's an incredibly helpful new blog post by @ljvmiranda921 on extracting PDFs using prodigy as well.

And also helpful accompanying GitHub repo.

prodigy_correct (1)

Topic		Replies	Views
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	141	January 14, 2025
Using prodigy with PDF documents usage	3	4765	February 20, 2018
Data prep Getting Started usage	2	546	April 26, 2022
pdf.spans.manual	1	52	December 2, 2024
Pretrain Model to extract data from PDFs using .jsonl data	5	495	May 9, 2024

Extracting data from PDFs using prodigy

Related topics