Prodigy - text extraction from forms

Hi! You can definitely label data for this type of task, but you'd have to decide which OCR solution or tool you want to use to go from image + bounding boxes to text.

For example, you can use a workflow like image.manual and draw the boxes: https://prodi.gy/docs/recipes#image-manual If the boxes are mostly in the same place and have similar pixel coordinates, you could even pre-populate them in the data so you only have to adjust the box if the position changed. The data you export gives you the image and pixel coordinates of the box (and the label you assigned). You can then feed that forward into an OCR tool and extract the text – the image and x/y/width/height of the region should typically be all you need for this.

(If you're using your own OCR model, you can also use Prodigy to improve it, e.g. by streaming in images + the extracted text and correcting the text if the model makes mistakes. You can then update it with more examples to improve it.)