Hi Kieran!
Thanks for your question! Just curious, what is the structure of the data in the PDF files? Raw text, tables, figures, images?
To get started, have you seen this post? Using prodigy with PDF documents
I like Ines' approach because it skips the step of saving intermediary files (.txt
, .spacy
, .jsonl
) and writes the files as a generator that can be piped to prodigy. I've also had success with PyMuPDF
too and output documents as .txt
or .jsonl
. With either of these you may want to also consider doing sentence segmentation before hand to break up the documents.
There are other approaches like embedding the pdf directly too that may help: How to embed a PDF file to a recipe.