Annotating references between bounding boxes for document understanding

My use case is creating a dataset to train models on data extraction from CV PDFs. The CVs are notoriously weird in their structuring, and I've been using the PDF and OCR recipes to annotate things like previous experience and educations. But there's related information to these things like dates and times, that might be on some other place of the page, and would need to be related to the bounding boxes of the other text parts. Is it possible to create a recipe and interface to allow such relation annotation between bounding boxes?

Hi @HerrSebi! Sounds like an interesting question. There's nothing out of the box that could solve this, but it might be generally better to solve this in two steps to:

  1. reduce the cognitive load of labelling and;
  2. ensure that the bounding boxes are right before doing additional classification.

Therefore, the approach could look like:

  1. Identify bounding boxes of interest, including bounding boxes for dates and times that you are finding in other places of the page. As you’re already using, the Prodigy-pdf plugin makes a lot of sense here.

  2. Once you have the bounding boxes of interest, you could reframe the relation annotation instead as a binary classification task. You could customise a basic image classification recipe to have a function that creates a task to show combinations of bounding box images that you could subsequently label as related or not. Refer to this previous support forum question on loading in image pairs here.

A few things you will need to consider:

  1. keeping track of ids: you’ll need to be able to link the bounding boxes in the first step with the images in the second. You could include the bounding box IDs generated in the first as metadata in the second.
  2. converting bounding boxes to images: if you would like to compare images of bounding boxes to see if they’re related, you’ll need to convert the boxes first from your labelled data. You can use libraries like pdf2image to convert the pdfs to images and Pillow to crop images based on bounding box dimensions.
  3. be thoughtful about the image combinations: you don’t need to compare bounding boxes from different documents so be thoughtful about how you generate combinations.

Hopefully this a good departure point for your custom, 2-step labelling instance. Good luck!