Hey All!
I'm working on building a resume parser, and wanted to ask for some advice on my data annotation strategy. I have images of documents, and I'm predicting bounding boxes and doing OCR to get the text output. I want to group together the predictions in a few ways:
- Group together multi-line bullet points
- Group together all the bullet points under Roles and responsibilities
- Group together the blue box and the roles and responsibilities
This is just one example of a resume template, but there are of course many different templates. My idea is to use the position on the page, the document layout, and the text content to group blobs of text together using something like a graph neural network, and then leverage the cleaned up output for some downstream tasks like classification and NER. I'm having a hard time getting started because it's not clear the best way to annotate for this task. I'm thinking maybe the best approach is to write some custom javascript for the front-end to enable linking together the bounding boxes? Definitely open to any ideas
It's a little extra complicated because there are multiple experience in one image / document, so I can't just label the "class" of the bounding box, because I need to differentiate between experience record #1 at Align Technology and experience record #2 at Amerisourceberger.
I've also thought about other strategies like using larger bounding boxes that cover larger document sections, but then I run into issues with multiple page documents and more diverse layouts where sections are disconnected or there are other pieces in between two related text fields.
Any recommendations / advice?
Thanks!
Daniel