Extracting data from PDFs using prodigy

Hi, interested in thoughts for how best to get PDF documents into a format that will work best for extraction of data, what is the best format to get the PDF documents into? Json or spacy format or other, any inputs appreciated

Hi Kieran!

Thanks for your question! Just curious, what is the structure of the data in the PDF files? Raw text, tables, figures, images?

To get started, have you seen this post? Using prodigy with PDF documents

I like Ines' approach because it skips the step of saving intermediary files (.txt, .spacy, .jsonl) and writes the files as a generator that can be piped to prodigy. I've also had success with PyMuPDF too and output documents as .txt or .jsonl. With either of these you may want to also consider doing sentence segmentation before hand to break up the documents.

There are other approaches like embedding the pdf directly too that may help: How to embed a PDF file to a recipe.