Prodigy - text extraction from forms

ines · December 3, 2020, 1:47am

Hi! You can definitely label data for this type of task, but you'd have to decide which OCR solution or tool you want to use to go from image + bounding boxes to text.

For example, you can use a workflow like image.manual and draw the boxes: https://prodi.gy/docs/recipes#image-manual If the boxes are mostly in the same place and have similar pixel coordinates, you could even pre-populate them in the data so you only have to adjust the box if the position changed. The data you export gives you the image and pixel coordinates of the box (and the label you assigned). You can then feed that forward into an OCR tool and extract the text – the image and x/y/width/height of the region should typically be all you need for this.

(If you're using your own OCR model, you can also use Prodigy to improve it, e.g. by streaming in images + the extracted text and correcting the text if the model makes mistakes. You can then update it with more examples to improve it.)

Topic		Replies	Views
Bounding boxes on semi-structured forms usage	2	371	March 10, 2022
Image segmentation (bounding boxes) for textual images image	9	2882	March 29, 2021
Document layout analysis usage , image , custom	6	1149	March 10, 2021
Document Images - Textual Images Labeling	1	319	April 20, 2022
Annotating PDFs by drawing bounding box around fields usage , front-end	1	2628	February 27, 2019

Prodigy - text extraction from forms

Related topics