Extracting data from PDFs using prodigy

ryanwesslen · May 9, 2022, 1:37am

Hi Kieran!

Thanks for your question! Just curious, what is the structure of the data in the PDF files? Raw text, tables, figures, images?

To get started, have you seen this post? Using prodigy with PDF documents

I like Ines' approach because it skips the step of saving intermediary files (.txt, .spacy, .jsonl) and writes the files as a generator that can be piped to prodigy. I've also had success with PyMuPDF too and output documents as .txt or .jsonl. With either of these you may want to also consider doing sentence segmentation before hand to break up the documents.

There are other approaches like embedding the pdf directly too that may help: How to embed a PDF file to a recipe.

Topic		Replies	Views
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	140	January 14, 2025
Using prodigy with PDF documents usage	3	4764	February 20, 2018
Data prep Getting Started usage	2	546	April 26, 2022
pdf.spans.manual	1	49	December 2, 2024
Pretrain Model to extract data from PDFs using .jsonl data	5	483	May 9, 2024

Extracting data from PDFs using prodigy

Related topics