Specifying Language Parameter in pdf.ocr

Hi Prodigy Support Team,

I'm currently working with German and French PDFs in my Prodigy workflow. I'm using the pdf.image recipe followed by the pdf.ocr recipe to extract text.

Unfortunately, I'm encountering issues with the OCR output, specifically with special characters like 'ä', 'ü', 'ö', 'ç', etc. These characters are often not recognized correctly, or the OCR process fails after encountering them.

I've tested pytesseract directly in my terminal, and I found that specifying the language parameter (e.g., lang='deu') resolves the special character recognition problem.

My question is: Is there a way to pass this lang parameter to the pdf.ocr recipe within Prodigy? I'd like to be able to specify the language for each PDF to improve the accuracy of the OCR output.

Thanks in advance for your help!

Best regards,

Welcome to the forum @dh_gerard! :waving_hand:

It should be straightforward to pass the lang argument since the plugin directly calls pytesseract.image_to_text, which accepts this parameter.
In fact, it was such a tiny change that I went ahead and released a new version (0.4.3) of the plugin that accepts --lang argument for pdf.ocr.correct recipe.
Now you can pass any language code that pytesseract supports i.e.:

['afr', 'amh', 'ara', 'asm', 'aze', 'aze_cyrl', 'bel', 'ben', 'bod', 'bos', 'bre', 'bul', 'cat', 'ceb', 'ces', 'chi_sim', 'chi_sim_vert', 'chi_tra', 'chi_tra_vert', 'chr', 'cos', 'cym', 'dan', 'deu', 'div', 'dzo', 'ell', 'eng', 'enm', 'epo', 'equ', 'est', 'eus', 'fao', 'fas', 'fil', 'fin', 'fra', 'frk', 'frm', 'fry', 'gla', 'gle', 'glg', 'grc', 'guj', 'hat', 'heb', 'hin', 'hrv', 'hun', 'hye', 'iku', 'ind', 'isl', 'ita', 'ita_old', 'jav', 'jpn', 'jpn_vert', 'kan', 'kat', 'kat_old', 'kaz', 'khm', 'kir', 'kmr', 'kor', 'kor_vert', 'lao', 'lat', 'lav', 'lit', 'ltz', 'mal', 'mar', 'mkd', 'mlt', 'mon', 'mri', 'msa', 'mya', 'nep', 'nld', 'nor', 'oci', 'ori', 'osd', 'pan', 'pol', 'por', 'pus', 'que', 'ron', 'rus', 'san', 'sin', 'slk', 'slv', 'snd', 'snum', 'spa', 'spa_old', 'sqi', 'srp', 'srp_latn', 'sun', 'swa', 'swe', 'syr', 'tam', 'tat', 'tel', 'tgk', 'tha', 'tir', 'ton', 'tur', 'uig', 'ukr', 'urd', 'uzb', 'uzb_cyrl', 'vie', 'yid', 'yor']

or a combination thereof e.g. eng+fra

so with the version 0.4.3 of the plugin you should be able to pass the language argument like so:

python -m prodigy pdf.ocr.correct target dataset:souce --labels FOO --lang fra+deu

Hi Magdaaniol,

This is exactly what I needed! Thank you so much for the improvement/solution. I really appreciate your help!

1 Like