Hi, interested in thoughts for how best to get PDF documents into a format that will work best for extraction of data, what is the best format to get the PDF documents into? Json or spacy format or other, any inputs appreciated
Thanks for your question! Just curious, what is the structure of the data in the PDF files? Raw text, tables, figures, images?
To get started, have you seen this post? Using prodigy with PDF documents
I like Ines' approach because it skips the step of saving intermediary files (
.jsonl) and writes the files as a generator that can be piped to prodigy. I've also had success with
PyMuPDF too and output documents as
.jsonl. With either of these you may want to also consider doing sentence segmentation before hand to break up the documents.
There are other approaches like embedding the pdf directly too that may help: How to embed a PDF file to a recipe.