Hi Prodigy Support Team,
I'm currently working with German and French PDFs in my Prodigy workflow. I'm using the pdf.image
recipe followed by the pdf.ocr
recipe to extract text.
Unfortunately, I'm encountering issues with the OCR output, specifically with special characters like 'ä', 'ü', 'ö', 'ç', etc. These characters are often not recognized correctly, or the OCR process fails after encountering them.
I've tested pytesseract
directly in my terminal, and I found that specifying the language parameter (e.g., lang='deu'
) resolves the special character recognition problem.
My question is: Is there a way to pass this lang
parameter to the pdf.ocr
recipe within Prodigy? I'd like to be able to specify the language for each PDF to improve the accuracy of the OCR output.
Thanks in advance for your help!
Best regards,