Hi Prodigy Support Team,
I'm currently working with German and French PDFs in my Prodigy workflow. I'm using the pdf.image recipe followed by the pdf.ocr recipe to extract text.
Unfortunately, I'm encountering issues with the OCR output, specifically with special characters like 'ä', 'ü', 'ö', 'ç', etc. These characters are often not recognized correctly, or the OCR process fails after encountering them.
I've tested pytesseract directly in my terminal, and I found that specifying the language parameter (e.g., lang='deu') resolves the special character recognition problem.
My question is: Is there a way to pass this lang parameter to the pdf.ocr recipe within Prodigy? I'd like to be able to specify the language for each PDF to improve the accuracy of the OCR output.
Thanks in advance for your help!
Best regards,