Specifying Language Parameter in pdf.ocr

dh_gerard · August 9, 2025, 5:27am

Hi Prodigy Support Team,

I'm currently working with German and French PDFs in my Prodigy workflow. I'm using the pdf.image recipe followed by the pdf.ocr recipe to extract text.

Unfortunately, I'm encountering issues with the OCR output, specifically with special characters like 'ä', 'ü', 'ö', 'ç', etc. These characters are often not recognized correctly, or the OCR process fails after encountering them.

I've tested pytesseract directly in my terminal, and I found that specifying the language parameter (e.g., lang='deu') resolves the special character recognition problem.

My question is: Is there a way to pass this lang parameter to the pdf.ocr recipe within Prodigy? I'd like to be able to specify the language for each PDF to improve the accuracy of the OCR output.

Thanks in advance for your help!

Best regards,

magdaaniol · August 11, 2025, 9:26am

Welcome to the forum @dh_gerard!

It should be straightforward to pass the lang argument since the plugin directly calls pytesseract.image_to_text, which accepts this parameter.
In fact, it was such a tiny change that I went ahead and released a new version (0.4.3) of the plugin that accepts --lang argument for pdf.ocr.correct recipe.
Now you can pass any language code that pytesseract supports i.e.:

['afr', 'amh', 'ara', 'asm', 'aze', 'aze_cyrl', 'bel', 'ben', 'bod', 'bos', 'bre', 'bul', 'cat', 'ceb', 'ces', 'chi_sim', 'chi_sim_vert', 'chi_tra', 'chi_tra_vert', 'chr', 'cos', 'cym', 'dan', 'deu', 'div', 'dzo', 'ell', 'eng', 'enm', 'epo', 'equ', 'est', 'eus', 'fao', 'fas', 'fil', 'fin', 'fra', 'frk', 'frm', 'fry', 'gla', 'gle', 'glg', 'grc', 'guj', 'hat', 'heb', 'hin', 'hrv', 'hun', 'hye', 'iku', 'ind', 'isl', 'ita', 'ita_old', 'jav', 'jpn', 'jpn_vert', 'kan', 'kat', 'kat_old', 'kaz', 'khm', 'kir', 'kmr', 'kor', 'kor_vert', 'lao', 'lat', 'lav', 'lit', 'ltz', 'mal', 'mar', 'mkd', 'mlt', 'mon', 'mri', 'msa', 'mya', 'nep', 'nld', 'nor', 'oci', 'ori', 'osd', 'pan', 'pol', 'por', 'pus', 'que', 'ron', 'rus', 'san', 'sin', 'slk', 'slv', 'snd', 'snum', 'spa', 'spa_old', 'sqi', 'srp', 'srp_latn', 'sun', 'swa', 'swe', 'syr', 'tam', 'tat', 'tel', 'tgk', 'tha', 'tir', 'ton', 'tur', 'uig', 'ukr', 'urd', 'uzb', 'uzb_cyrl', 'vie', 'yid', 'yor']

or a combination thereof e.g. eng+fra

so with the version 0.4.3 of the plugin you should be able to pass the language argument like so:

python -m prodigy pdf.ocr.correct target dataset:souce --labels FOO --lang fra+deu

dh_gerard · August 11, 2025, 10:18am

Hi Magdaaniol,

This is exactly what I needed! Thank you so much for the improvement/solution. I really appreciate your help!

Topic		Replies	Views
Problem with path in pdf.image.manual	5	251	October 30, 2023
PDF OCR Image annotation metadata - feature suggestion? usage , best-practices	3	226	May 13, 2024
Adding a helper image textcat , custom , front-end	4	433	November 10, 2022
pdf.spans.manual recipe from prodigy-pdf extracting text in hexadecimal format, but it should be plain text	1	40	May 19, 2025
Taking a Computer Vision Approach (leveraging image.manual) to build a custom NER model on PDFs usage , ner , image	3	607	July 28, 2022

Specifying Language Parameter in pdf.ocr

Related topics