pdf.spans.manual recipe from prodigy-pdf extracting text in hexadecimal format, but it should be plain text

basavarm · May 12, 2025, 5:39am

Hello Prodigy team,

I really appreciate the great work put on building such a great tool for annotation.
I have been using the tool for PDFs annotation lately. Please read below my issue.

I have been using prodigy-pdfs pdf.spans.manual recipe for PDF spans annotation for quite some time, its working for most of the documents untill i came accross this specific PDF formats ( Germany Company Registry document).

Steps to reproduce:
Process the Germany Company Registry pdf document with
below configuration

docling_model_path = os.getenv("DOCLING_MODEL_PATH") #local doclng model path
pdf_pipeline_options = PdfPipelineOptions(
    do_ocr=False,
    do_table_structure=True,
    table_structure_options={"do_cell_matching": True},
    artifacts_path=docling_model_path,
)

format_options = dict(
    {
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=StandardPdfPipeline, pipeline_options=pdf_pipeline_options, backend=PyPdfiumDocumentBackend
        ),
        # TODO: can handle other format such as DocX, HTML, mails, etc
    }
)

Out put is spacy model with more than two lack tokens (Hexadecimal token text) because of this browser is crashing with stack overflow issue.

Expected output : Plain text output tokens.

Please note : I also tried force_full_page_ocr with tessaract cli options, but no luck.

Would appreciate your help.

magdaaniol · May 19, 2025, 2:53pm

Welcome to the forum @basavarm !

Thank you for all the kind words!

I'm afraid, I'm not entirely sure what your workflow is. Judging by the configuration you provided (btw. I have formatted it as code for readability) you work directly with docling ?
Are the problematic tokens present in the docling output or do they appear in later processing (not described yet)?
Could you also share the StackOverflow issue you referred to? Thanks!

Topic		Replies	Views
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	152	January 14, 2025
prodigy-pdf with Azure AI Document intelligence instead of docling? usage , spacy	3	96	May 21, 2025
Documents annotations (from .pdf,.doc,.docx resumes) usage , ner , hr	4	1326	March 30, 2020
Annotated Data output formatting usage	1	720	February 12, 2019
Mismatching spans usage , ner , solved	3	336	July 15, 2021

pdf.spans.manual recipe from prodigy-pdf extracting text in hexadecimal format, but it should be plain text

Related topics