Data Annotation Strategy for Resume Parser (Images with Grouped Bounding Boxes)?

Hey All!

I'm working on building a resume parser, and wanted to ask for some advice on my data annotation strategy. I have images of documents, and I'm predicting bounding boxes and doing OCR to get the text output. I want to group together the predictions in a few ways:

  1. Group together multi-line bullet points
  2. Group together all the bullet points under Roles and responsibilities
  3. Group together the blue box and the roles and responsibilities

This is just one example of a resume template, but there are of course many different templates. My idea is to use the position on the page, the document layout, and the text content to group blobs of text together using something like a graph neural network, and then leverage the cleaned up output for some downstream tasks like classification and NER. I'm having a hard time getting started because it's not clear the best way to annotate for this task. I'm thinking maybe the best approach is to write some custom javascript for the front-end to enable linking together the bounding boxes? Definitely open to any ideas :slight_smile:

It's a little extra complicated because there are multiple experience in one image / document, so I can't just label the "class" of the bounding box, because I need to differentiate between experience record #1 at Align Technology and experience record #2 at Amerisourceberger.

I've also thought about other strategies like using larger bounding boxes that cover larger document sections, but then I run into issues with multiple page documents and more diverse layouts where sections are disconnected or there are other pieces in between two related text fields.

Any recommendations / advice?

Thanks!

Daniel

There's definitely other people working on this type of problem, and I think there's lots of tasks that come down to this sort of thing. But I haven't had much experience with it myself, so I don't have many concrete suggestions.

Have you considered doing it as a computer vision task first? You could think of it like object detection or image segmentation and use that sort of approach. You can imagine doing this sort of task with a writing system you don't know: from that perspective, it's a visual task rather than an NLP one. The boundaries are quite sharp so it's probably relatively easy on the scale of image segmentation or object detection tasks.

There are also some APIs for this. For instance, I think AWS have a product called "textract" that might be helpful. I've never used it though.

Hey @honnibal, thanks a million for the reply :slight_smile:

My main question is if Prodigy supports grouping / linking together bounding boxes during image labeling?

I definitely agree that this is at least partly a computer vision problem. I think the visual / spatial features are at least as important as the text content itself. The screenshot I shared is actually from AWS Textract, so you were spot on, and we're definitely thinking about this the same way. My approach is to first get bounding boxes and OCR text from their API, then use this representation for downstream tasks (such as grouping together / linking different text elements together, classifying text elements / groups, etc.) I keep seeing research papers come out where they use a similar approach as described above, but I'm not sure how they are annotating their data, and I was hoping that Prodigy had this functionality.

For example, in the screenshot below, two lines of text ("Textrun") are recognized separately, but are then grouped together into a new object "TextBlock". I guess you would need that grouping labeled somehow to solve this as a supervised learning problem.


(Aggarwal, Milan, Mausoom Sarkar, Hiresh Gupta, and Balaji Krishnamurthy. “Multi-Modal Association Based Grouping for Form Structure Extraction.” In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) , 2064–73. Snowmass Village, CO, USA: IEEE, 2020. https://doi.org/10.1109/WACV45572.2020.9093376.)

In another example, the text boxes are grouped together in a graph structure, and then predictions are made on top of that. In the same way, I am assuming you would need to label which elements were grouped together (and here they have a hierarchical grouping which makes it a little more complex).

image
(Hwang, Wonseok, Jinyeong Yim, Seunghyun Park, Sohee Yang, and Minjoon Seo. “Spatial Dependency Parsing for 2D Document Understanding.” ArXiv:2005.00642 [Cs] , May 1, 2020. http://arxiv.org/abs/2005.00642.)

If it's helpful, here's how the dataset this paper references is annotated -- I'm just now sure how they went about creating those annotations (unless it was manually without awesome software to make their lives easier).

This is the closest example I could come up with in terms of desired UI functionality -- using the grouping feature in PowerPoint / Google Slides. In practice I think you'd also want the ability to label the groups you create or the relationships between objects. This feels somewhat similar to the new dependency parsing annotation interface -- do you think something like this is possible in Prodigy?

Sorry for writing such a long post, I hope it made sense what I'm trying to accomplish :slight_smile:

Thanks so much for everything you and your team does!

Thanks for the detailed explanation :+1: There's no built-in feature currently to do the grouping, and where it gets tricky is when the boxes you're grouping don't align as neatly like in the examples above. This would then need a whole additional layer of indicating what's part of which group, and completely separate actions for grouping, ungrouping, selecting to group and so on.

That said, if you mostly expect the boxes you want to group to line up (more or less), and you don't really expect to have much overlap, a simple solution would be this: You could have an additional label GROUP and use it to draw boxes or shapes around existing boxes that belong together.

Given the resulting data with the annotated boxes and shapes, it will be fairly trivial to calculate which boxes are inside a given group – that's just a basic geometry formula :smiley: