pdf.spans.manual recipe from prodigy-pdf extracting text in hexadecimal format, but it should be plain text

Hello Prodigy team,

I really appreciate the great work put on building such a great tool for annotation.
I have been using the tool for PDFs annotation lately. Please read below my issue.

I have been using prodigy-pdfs pdf.spans.manual recipe for PDF spans annotation for quite some time, its working for most of the documents untill i came accross this specific PDF formats ( Germany Company Registry document).

Steps to reproduce:
Process the Germany Company Registry pdf document with
below configuration

docling_model_path = os.getenv("DOCLING_MODEL_PATH") #local doclng model path
pdf_pipeline_options = PdfPipelineOptions(
    do_ocr=False,
    do_table_structure=True,
    table_structure_options={"do_cell_matching": True},
    artifacts_path=docling_model_path,
)

format_options = dict(
    {
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=StandardPdfPipeline, pipeline_options=pdf_pipeline_options, backend=PyPdfiumDocumentBackend
        ),
        # TODO: can handle other format such as DocX, HTML, mails, etc
    }
)

Out put is spacy model with more than two lack tokens (Hexadecimal token text) because of this browser is crashing with stack overflow issue.

Expected output : Plain text output tokens.

Please note : I also tried force_full_page_ocr with tessaract cli options, but no luck.

Would appreciate your help.

Welcome to the forum @basavarm :waving_hand: !

Thank you for all the kind words!

I'm afraid, I'm not entirely sure what your workflow is. Judging by the configuration you provided (btw. I have formatted it as code for readability) you work directly with docling ?
Are the problematic tokens present in the docling output or do they appear in later processing (not described yet)?
Could you also share the StackOverflow issue you referred to? Thanks!