Taking a Computer Vision Approach (leveraging image.manual) to build a custom NER model on PDFs

Hi! I am very new to Prodi.gy and am still learning how to use the tool.

I would like to build a custom NER pipeline on pdf electric bill documents, and I saw Ines's comment, on one of the post asking a similar question, about formulating this task into a CV problem:

An alternative approach that I've been seeing more often is framing the whole problem differently and as a computer vision task. This seems to be especially effective if the visual strucutre of the documents holds a lot of important clues, like in an invoice. So the model would then predict where the recipient or total amount is, and in the next step, you'd use OCR to convert the contents of the bounding box to text. This approach is more involved, though, and potentially overkill for this specificl use case.

Essentially, you would leverage image.manual and draw bounding boxes on the features you would like to extract, then perform OCR to recognize the texts. I am quite fascinated by this approach actually, given that I am working on documents like electric bills to which the visual structure/layout conveys a lot of information for the NER task. I have a few questions to clarify before moving forward:

  • Will I have to convert the PDFs into images locally before performing annotations? Currently, Prodi.gy doesn't accept PDFs right?
  • What is the output going to be like from this CV-powered NER pipeline? What does it return? I am just trying to understand how I could perform OCR based on the results given.
  • How can I automate the labeling process for the task? There might be features that occur multiple times within a document, how could I auto-tag all of them?

Thank you! Explosion is awesome :wink:

Hi Jetson, here are the answers to your questions:

  1. Yes, correct. Prodigy doesn't natively support PDFs. However, you can choose to write your own custom recipe that is able to use them. That way, you could consider using a Python package that can parse .pdf files. You might be able to consider this if your pdfs follow a very strict structure, but the image OCR path seems like a more common approach.
  2. An annotated image will have bounding boxes with the data format described here.
  3. You may appreciate this answer if you want to auto-tag images.
1 Like

Hi Vincent - thank you for getting back to me :slight_smile:

From what I have seen so far, users often apply CV for image classifications, but not OCR text detection. My understanding of the workflow is this:

  • Convert PDFs into images first.
  • Annotate the PDFs with image.manual and draw bounding boxes over the texts you would like to extract, and annotate them with the entity labels.
  • Once you have the annotated dataset, feed it into spaCy for training.
  • Use the trained model on unseen PDFs and images.

Does this look good?

For clarification, image.manual and image.correct are only used to annotate data right? They don't involve training the models. I am still a little confused by these two steps. Do you use only image.manual for annotating all of your data, or do you use both image.manual and image.correct for data annotation?

Another thing is that, from the link you sent me, I can see that annotated images are stored in a format like this:

So, if I use a trained model on unseen pdf documents and images, it should also predict bounding boxes around information and store the predicted image in the same format right? How could I integrate OCR to extract texts with the given format?

Addtional information:
I just discovered the LayoutLMv3 model on Hugging Face for NER on PDF documents. How could I convert the annotated iamges into a workable format for these models? It seems like conversion of data formats has been a challenging task since I started working on NER.

There are many ways to accomplish this, but given a model that can predict the bounding boxes ... the simplest method might be a custom Python script that converts the image using something like tesseract. I wrote a small tutorial for that on calmcode if you're interested.

You are correct in saying that image.manual and image.correct merely annotate data. They do not train a model. I'm usually more of a fan to use the .manual recipes and I like to prepare a script upfront that helps me select interesting candidates upfront for annotation.

I have never worked with LayoutLMv3 model myself, but you may appreciate this blogpost for more info on that.