Possibility for multi-field extraction from OCRd images

Stumbled upon Prodigy and am very excited for its use in a scanned document recognition application we are building at my company.

In short, we have millions of mortgage, deed, tax, foreclosure, and other real estate related documents that we need to extract data from. In the case of a mortgage - for example - we may want to extract the property address, interest rate and maturity date from the original scanned copies. In the case of a foreclosure document - we'd want to extract the property address, auction date, and minimum bid. The problem we have is that often documents do not OCR correctly because the scanned originals have blotchy ink, were scanned with older technology that didn't copy well, etc...

So we need 3 things:

  1. a clean interface for fixing OCRd mistakes - something that overlays the OCRd text atop the original scanned text
  2. a way for distributing this OCR correction / data extraction task to multiple individuals in a way that verifies their work (for example - via consensus)
  3. a way to re-train our OCR from the annotations / extractions performed in steps 1 and 2, so that we can scale the system to effectively perform the same task against 10s of millions of such documents.

Is this possible in prodigy? If not, can we discuss working together to make it possible? Thank you in advance :slight_smile:

Hi! Do you have an example of this? While it's possible in theory, it sounds like it'd potentially be slightly confusing and difficult to read? Or maybe I'm imagining it wrong?

One possible interface you could put together for this is a blocks UI with two blocks: an image and a text_input that's pre-populated with the OCRd text and can be edited if needed.

Another thing that could be cool and potentially very efficient: it's probably pretty easy to pre-extract the segments of the image that contain the text you care about in an automated process, or at least segment the imagine into regions. Then you can correct the OCRd text for each region separately. This makes it easier for your annotators, because they can take in the whole information at once and don't have constantly go back between the text and the long document. And you'll probably also be able to move through the review process quicker.

You could either use the built-in review workflow, or write your own recipe that streams in existing annotations from your dataset(s), groups annotations on the same data together and shows the different annotation decisions and lets you create a final answer in a text_input field, maybe populated with the solution most annotators agreed on. If you're constrasting two different results, you could also show them as a visual diff.

When you load in your data, you can assign your own IDs to the incoming examples, so you'll always be able to identify annotations on the same data, or documents that belong together etc.

The training process itself is of course outside of the scope of Prodigy, since this will depend on the tool and models you use. But you can write Prodigy workflows to run quick training experiments with the annotations, workflows for comparing different OCR models and their output (e.g. randomised A/B evaluation), and annotation workflows that always use your latest updated OCR model.

Hi Ines, thank you so much for that thorough response! I think that's a great idea about doing it with text on one side, and the image on the other. To your question about an example of overlaying text on the OCR, here is a video of a solution we are looking at. Around minute 2:10 it shows a UX similar to what you described could be our workaround... where the annotator is able to confirm whether a data field was extracted correctly by pressing a toggle switch. Then around minute 2:30 it shows how annotators can tab through the different areas of the image where extraction rules extracted a data field of interest and simultaneously edit that text if needed.

I also like the idea of breaking up the document into chunks which can be more easily worked with. The only concern there is that I think we would still always need a way to identify the chunk's original document - which I imagine is possible in Prodigy. And if we ever wanted to get into annotating millions of documents, then we would need to build in a way to route crowdsourced workers (say from CrowdFlower or Amazon Mechanical Turk) into Prodigy. That makes sense and I can look into how to do that. Out of curiosity, does the video I shared change the answer you kindly provided above? Or do you think a side by side is still the better approach?

Thank you also for restating Prodigy's scope... that helps me better understand that prodigy is strictly an efficient annotation tool.. The output of which is used to improve some AI model.

What is pricing for Prodigy?