Stumbled upon Prodigy and am very excited for its use in a scanned document recognition application we are building at my company.
In short, we have millions of mortgage, deed, tax, foreclosure, and other real estate related documents that we need to extract data from. In the case of a mortgage - for example - we may want to extract the property address, interest rate and maturity date from the original scanned copies. In the case of a foreclosure document - we'd want to extract the property address, auction date, and minimum bid. The problem we have is that often documents do not OCR correctly because the scanned originals have blotchy ink, were scanned with older technology that didn't copy well, etc...
So we need 3 things:
- a clean interface for fixing OCRd mistakes - something that overlays the OCRd text atop the original scanned text
- a way for distributing this OCR correction / data extraction task to multiple individuals in a way that verifies their work (for example - via consensus)
- a way to re-train our OCR from the annotations / extractions performed in steps 1 and 2, so that we can scale the system to effectively perform the same task against 10s of millions of such documents.
Is this possible in prodigy? If not, can we discuss working together to make it possible? Thank you in advance