Thanks for your question! I actually wrote a simple PDF loader a while ago using PyPDF2. It's probably not the most elegant solution, and also just extracts the verbatim text of the page:
from pathlib import Path
import PyPDF2
def get_pdf_stream(pdf_dir):
pdf_dir = Path(pdf_dir)
for pdf_file in pdf_dir.iterdir():
file_obj = pdf_file.open('rb')
reader = PyPDF2.PdfFileReader(file_obj)
page_count = reader.numPages
text = ''
for page in range(page_count):
page_obj = reader.getPage(page)
text += page_obj.extractText()
yield {'text': text}
Prodigy streams are generators, which is pretty nice for this type of use case. Reading in the PDFs might take some time, and by streaming in the results, you can start annotating immediately and won't have to wait for your whole data to be processed.
One of the nice things about the source
argument on the command line is that it defaults to stdin
. So you can pipe through data from other scripts by writing the annotation tasks to stdout
...
print(json.dumps({'text': text}))
... and piping them to the recipe like this:
python extract_pdfs.py | prodigy ner.teach your_dataset en_core_web_sm
Ultimately, this depends on what your application needs to do – especially at runtime. PDF scraping is hard, so if your system needs to be able to read in PDF files or scraped text from PDFs and predict something on them, having "unclean" training data can actually be very useful. With enough examples, the model can learn to handle the noisy data, and you might achieve much better accuracy overall. Similarly, if your model needs to be able to parse tabular data instead of perfect, complete sentences, your training data should match that.
Another thing you could consider is to chain models together and start with a text classifier to help you pre-select the examples or separate the incoming data into buckets of different text types. This is actually a pretty common workflow.
For example, let's say your data contains both paragraphs of regular text and tables / key-value pairs. The text classifier might be able to learn that distinction pretty quickly, so you could start off by training a model that predicts whether you're dealing with "regular text" or not, and use this to filter the incoming stream of examples. Maybe it turns out that one NER model can easily handle both types of input – or maybe you'll find that you get better results if you train two separate models and use the text classifier to decide which one to apply to the text later on.
I hope Prodigy makes it easy to try out different approaches and find out what works best on your data!