Prodigy - text extraction from forms

JBB · December 1, 2020, 6:58am

Hi @Innes, Is it possible to extract text from forms with Prodigy. We have a large number of printed order forms with text in the boxes. Each box is in exactly the same place in each form but the text can vary.

I think my question is can I fix and a bounding box on a specific part of an image and give it a field label so Prodigy can extract the text in detects in that box?

ines · December 3, 2020, 1:47am

Hi! You can definitely label data for this type of task, but you'd have to decide which OCR solution or tool you want to use to go from image + bounding boxes to text.

For example, you can use a workflow like image.manual and draw the boxes: https://prodi.gy/docs/recipes#image-manual If the boxes are mostly in the same place and have similar pixel coordinates, you could even pre-populate them in the data so you only have to adjust the box if the position changed. The data you export gives you the image and pixel coordinates of the box (and the label you assigned). You can then feed that forward into an OCR tool and extract the text – the image and x/y/width/height of the region should typically be all you need for this.

(If you're using your own OCR model, you can also use Prodigy to improve it, e.g. by streaming in images + the extracted text and correcting the text if the model makes mistakes. You can then update it with more examples to improve it.)

Topic		Replies	Views
Bounding boxes on semi-structured forms usage	2	371	March 10, 2022
Image segmentation (bounding boxes) for textual images image	9	2873	March 29, 2021
Document layout analysis usage , image , custom	6	1147	March 10, 2021
Document Images - Textual Images Labeling	1	319	April 20, 2022
Annotating PDFs by drawing bounding box around fields usage , front-end	1	2626	February 27, 2019

Prodigy - text extraction from forms

Related topics