Review Approaches to NER on Unstructured Data (and Discussing Amazon Comprehend vs spaCy + Prodigy)

Hi fellow NLP enthusiasts :smiley:!

I am really excited to set sail on my NLP journey, particularly on information retrieval! I have some roadblocks that I have encountered while working on a custom NER project, and would really appreciate some help to lift them :slight_smile: To make my post more structured, I will categorize my post into 2 areas:

  • Improving NER model performance on unstructured data (PDFs, images)
  • Amazon Comprehend-related questions (Comprehend vs. Prodigy + spaCy)

NER Model Performance Improvement

  • Currently, I am training a customized NER model that can extract entities from unstructured data (PDFs mainly), aggregate them into a data frame and output it as a csv eventually. I am using different types of documents, such as ride-share receipts, hotel bills, electric bills, etc.
  • I have implemented an end-to-end NER pipeline leveraging Amazon Comprehend, and have trained the model 3 times with various dataset sizes. I started training the model with only 100 PDF documents, and got an F1 score of 0.72. I then increased my dataset size and trained the model again with 240 documents, and got an F1 score of 0.89.
  • At this point, I was happy with the result initially. I was convinced that increasing the dataset size will improve my model performance; but I also know that I have to collect a diverse dataset as well - for instance, electric bills coming from different utility providers have different layout and format; so, collecting electric bills from various providers would prevent overfitting to just one utility bill format.
  • I then trained the model using almost 300 documents, after collecting 50 more documents (30 of them are bills from 3 new providers, while I obtained the remaining 20 bills via data augmentation - simply adding watermarks to the documents that I previously have in the training set), but this time the overall performance decreases, notwithstanding performance for some of the features increase and some decrease.

This concludes the description of my problem, and here are my questions for category 1:

  • Is it too early to make any conclusion on model performance, giving that the dataset size is still relatively small? Why is adding new training data decrease my model performance?
  • I am experiencing data poverty, so I am applying data augmentation on existing documents to increase the training dataset volume; my approach on that is simply adding watermark to the documents without changing any values (e.g. the energy consumption value, the customer name). Would that be an issue?
  • I start to think that the actual performance of my second trial may have been misleading - it may not be that good. The reason is that Amazon auto-split my dataset without giving me the ability to perform random split. I am still trying to figure out how to do that on PDF data on Amazon; does anyone know how to?

To avoid any confusions: I am referring “1 electric bill” as “1 document”, but Amazon defines “number of documents” to be the number of pages in total - so 1 electric bill that contains 5 pages will translate to “5 documents”.

Amazon Comprehend vs spaCy - aka model from cloud providers vs model from scratch
I discovered the Hugging Face community, Prodigy and spaCy, and was immediately captivated by the work that people do there! I am pretty convinced that Prodigy is an exceptional tool to have in my arsenal when solving NLP challenges, and I have a few questions regarding that:

  • Will building the model from scratch leveraging Hugging Face transformers and spaCy give me better results than leveraging cloud providers?
  • For anyone who is familiar with Amazon Comprehend, are the labeled documents re-usable if I want to build my model from scratch rather than using cloud providers? It would be quite a pain to label everything again if I want to build my model from scratch using the Hugging Face transformers...
  • If I have to re-label everything again and train my NER model with the Hugging Face transformers, what would be the best approach to annotate PDF documents? I was reading this article on A framework for designing document processing solutions, and I really liked the idea of multimodal pipeline, which involves leveraging both the image-centric approach and the text-centric approach. Therefore, would OCR be a must, because simply converting to text will just discard all the spatial/visual information, correct?

I am really sorry for the long post, but these are some tough challenges that I ran into and would love some constructive feedback and opinion from my fellow NLP experts! Thank you :slight_smile:

1 Like

HI @jetsonearth welcome to Prodigy!

For Category 1, I can answer the first question:

  • You mentioned that in your second experiment, you added samples from different data providers. It's possible that the reason why the F-score went down is because the trained model wasn't able to generalize well to those other providers. You can sanity-check this by measuring per-provider performance.

    • Another possibility is that there's data imbalance with your different classes. I'm not familiar with Amazon Comprehend, but with spaCy you can check your NER datasets via debug data. Perhaps you can also do this manually, by checking how many examples you have for each given entity.

For Category 2, I can answer questions #1 and #3:

  • You can't always say for sure which one is more performant as it depends on different factors. For the cloud providers you're using, can you choose which kind of model you're using for your task? I think one primary advantage of the open source tools you mentioned (HuggingFace and spaCy) is that you can fully-customize them to suit your use case. You can change the base model, add / remove business rules, etc.

    • Personally, what I do is I leverage cloud-based tools as a baseline (since they're just one API call away), and then spend the remaining time working on more customizable pipelines where the goal is to "beat" that baseline.
  • Yes, converting it to text will just discard all the spatial / visual information. I think you can check out tools like pdfplumber in tandem with Prodigy's image.manual so that you can get both OCR and text annotations in one.

1 Like

Hi @ljvmiranda921 - thank you so much for your response! I would love to use Prodigy for my annotation workflow, but it's currently out of my budget :frowning: will I be able to get Prodigy for a free trial to try it out first?

Hi @jetsonearth ,

Glad it was helpful. For commercial trials, we typically host a VM that you can log in to. The trial lasts for two (2) weeks. This gives you the full experience of the tool, including the scriptable back-end, and also makes it easy for us to log in and help if you get stuck.

If that sounds interesting, you can email us at contact@explosion.ai.

P.S. Also, in case you are a student or an academic researcher, we still offer free academic licenses. Just mention it in the contact email above.

1 Like

I appreciate the offering :slight_smile: I have gotten a free trial of Prodi.gy to work on. Excited to dive deeper and build an NER model from scratch!

Hi @ljvmiranda921 - just realized you were the author of that amazing article A framework for designing document processing solutions! I gained so much insight from reading your work, thank you for sharing! I have gotten a few clarifying questions before I get started.

  • So, right now, I have a collection of PDF files, and I have used pdf2image to convert them to PNG images. Then, would I just need to call image.manual to start annotation? It'd be something like
    prodigy image.manual ner_dataset ./images_folder ?
  • Did you pass in images directly or did you load your images as json? How should I do it?
  • For your case, did you hand label everything, even though within a document some entities might be repeating? For my case, if I am identifying a unit of measurement of electricity like kWh, it may appear at multiple places on a document. I could auto-label it if this if I am leveraging ner.manual using string matching patterns; is it possible to do the same for images? Is there a way to speed up annotating?
  • Did you label everything in a document, including the the information that I am not interested (from the FUNSD example, the annotators have the "other" field). For my case, I have 6 entities that I want to recognize and extract from each document, but do I have to label everything though?
  • From this image, it looks like we are simply drawing bounding boxes around certain area of the image, and the bounding box doesn't contain textual information. How would you let the model train on both spatial and textual elements if the bounding box doesn't have textual information? Did you apply OCR after you draw bounding boxes and add the text to the json file?
  • Is Prodi.gy capable of exporting FUNSD-like data format by which I could use to fine-tune a LayoutLMV3 model? I was reviewing the FUNSD dataset to study its format, and it looks something like this:

    There is a piece of text associated with each bounding box. How can I add in the text component myself?
  • BTW I have just saw your github repo for the PDF processing pipeline :slight_smile: So does that mean I could clone the project, edit the project.yml file to suit my use case, provide my own dataset with annotations, and then train using the project template right away?

Thanks!

1 Like

Hi @jetsonearth , glad you found the article helpful :slight_smile:

  • So, right now, I have a collection of PDF files, and I have used pdf2image to convert them to PNG images. Then, would I just need to call image.manual to start annotation? It'd be something like
    prodigy image.manual ner_dataset ./images_folder ?

If you are using the base Prodigy, then yes you can run the command as it is.

  • Did you pass in images directly or did you load your images as json? How should I do it?

For the blogpost I decided to "hydrate" the database with the images because I already have the annotations and their OCR'd text. You can find the implementation in the hydrate-db method. In your case, you won't have your annotations yet so I think using image.manual should work. However, it seems that you also won't have OCR values, and that might be tricky.

  • For your case, did you hand label everything, even though within a document some entities might be repeating? For my case, if I am identifying a unit of measurement of electricity like kWh, it may appear at multiple places on a document. I could auto-label it if this if I am leveraging ner.manual using string matching patterns; is it possible to do the same for images? Is there a way to speed up annotating?

For the blogpost I am using a standard benchmarking dataset so I did little to no labelling. It might be good to label enough so that the model can discern which parts are necessary. I have the same answer as with your proceeding question.

From this image, it looks like we are simply drawing bounding boxes around certain area of the image, and the bounding box doesn't contain textual information. How would you let the model train on both spatial and textual elements if the bounding box doesn't have textual information? Did you apply OCR after you draw bounding boxes and add the text to the json file?

I think that's where the current limitation of image.manual lies. The FUNSD dataset already provides both bounding box and spatial information, so there's no need for me to perform OCR after labelling. If you're keen to use the blogpost's method, my advice is to perform the OCR first so that you can have the textual information. Integrating that with Prodigy might be a challenge though, and a bit involved, but what I suggested may be a great start!

Is Prodi.gy capable of exporting FUNSD-like data format by which I could use to fine-tune a LayoutLMV3 model? I was reviewing the FUNSD dataset to study its format, and it looks something like this:

In the blogpost I am using HuggingFace's format to train a LayoutLMv3 model. The output data format from Prodigy can easily be translated into that format. You can check this implementation for more info.

  • BTW I have just saw your github repo for the PDF processing pipeline :slight_smile: So does that mean I could clone the project, edit the project.yml file to suit my use case, provide my own dataset with annotations, and then train using the project template right away?

Yes! Feel free to clone and edit :slight_smile:

2 Likes