Using prodigy with PDF documents

Hi,

This may be a silly question, but I’m trying to use prodigy (NER and Text Classification recipes specifically) on a large corpus of PDF documents with the goal of information extraction.

At the moment I’m simply scraping the text from the pages using PDFMiner. I’m getting decent results but one of the (many) problems with these PDFs is that a lot of the information is conveyed by the page structure. For example some parts of the PDFs will have tables or a key value pair like “NLP Library: Spacy” instead of “Spacy is a NLP library”. I’m concerned that my rudimentary text scraping loses information that is conveyed by the structure of the document.

Do you have any suggestions on how I can use/extend prodigy to take this into account? I see in the roadmap that you plan on incorporating Excel spreadsheets… any plans for PDFs? Is that even possible or a different area?

Any pushes in the right direction would be helpful and apologies for any ambiguity!

Thanks,
Kevin

Thanks for your question! I actually wrote a simple PDF loader a while ago using PyPDF2. It's probably not the most elegant solution, and also just extracts the verbatim text of the page:

from pathlib import Path
import PyPDF2

def get_pdf_stream(pdf_dir):
    pdf_dir = Path(pdf_dir)
    for pdf_file in pdf_dir.iterdir():
        file_obj = pdf_file.open('rb')
        reader = PyPDF2.PdfFileReader(file_obj)
        page_count = reader.numPages
        text = ''
        for page in range(page_count):
            page_obj = reader.getPage(page)
            text += page_obj.extractText()
        yield {'text': text}

Prodigy streams are generators, which is pretty nice for this type of use case. Reading in the PDFs might take some time, and by streaming in the results, you can start annotating immediately and won't have to wait for your whole data to be processed.

One of the nice things about the source argument on the command line is that it defaults to stdin. So you can pipe through data from other scripts by writing the annotation tasks to stdout...

print(json.dumps({'text': text}))

... and piping them to the recipe like this:

python extract_pdfs.py | prodigy ner.teach your_dataset en_core_web_sm

Ultimately, this depends on what your application needs to do – especially at runtime. PDF scraping is hard, so if your system needs to be able to read in PDF files or scraped text from PDFs and predict something on them, having "unclean" training data can actually be very useful. With enough examples, the model can learn to handle the noisy data, and you might achieve much better accuracy overall. Similarly, if your model needs to be able to parse tabular data instead of perfect, complete sentences, your training data should match that.

Another thing you could consider is to chain models together and start with a text classifier to help you pre-select the examples or separate the incoming data into buckets of different text types. This is actually a pretty common workflow.

For example, let's say your data contains both paragraphs of regular text and tables / key-value pairs. The text classifier might be able to learn that distinction pretty quickly, so you could start off by training a model that predicts whether you're dealing with "regular text" or not, and use this to filter the incoming stream of examples. Maybe it turns out that one NER model can easily handle both types of input – or maybe you'll find that you get better results if you train two separate models and use the text classifier to decide which one to apply to the text later on.

I hope Prodigy makes it easy to try out different approaches and find out what works best on your data!

2 Likes

Ines, on topic of workflows, say I want to implement the one you describe above. How does that look step by step?

Would it be something like:

  1. Annotate text categories (pre-classification)
  2. Batch train model
  3. Classify corpus based on pre-classification
  4. Create separate dataset based on pre-classification
  5. Annotate pre-classified text for target classification

?

I know that we can exclude pre-annotated text via CLI, but is there an easy way to load in a dataset based on an annotated attribute such as dataset[‘regularText’][‘accept’] or something of the sort?

I feel like I took the long way around in the last project.

Thx!

The easiest way would probably be to export the set as a JSONL file and only include answers you've accepted. You can do this by setting the --answer key on the command line:

prodigy db-out your_dataset /output/dir --answer accept

You can then use the result as the input data for the next session. However, if you've been using active learning recipes like textcat.teach, I wouldn't necesarily recommend this workflow. The selection of examples from your stream of input text will be very biased – this is good, because it helps you collect only the most relevant examples to train the text classifier. However, those examples are not necessarily the best selection to train an NER model or a text classifier for a different task.

So a better solution would probably be to train your text classifier and then use it to filter your stream for the next step (e.g. NER). Here's a simple example of a filter function:

nlp = spacy.load('/your/textcat/model')

def filter_stream(stream):
    for eg in stream:
        doc = nlp(eg['text'])  # process the example text with your model
        score = doc.cats['REGULAR_TEXT']  # get textcat score
        if score >= 0.75:  # some selection criterion
            yield eg

For a slightly more sophisticated solution, you could also drop the score conditional and yield (score, example) tuples instead, and then use Prodigy's prefer_high_scores sorter. This will use an exponential moving average to determine which examples to present for annotation – so you'll have a little more flexibility:

from prodigy.components.sorters import prefer_high_scores

def filter_stream(stream):
    for eg in stream:
        doc = nlp(eg['text'])
        yield (doc.cats['REGULAR_TEXT'], eg)

stream = prefer_high_scores(filter_stream(stream))

A possible workflow could look something like this:

  1. Use textcat.teach and textcat.batch-train to train a text classifier for REGULAR_TEXT.
  2. Use this classifier to filter the incoming stream (can be the same raw data you used in the first step) for the target classification task.