Hello Prodigy team,
I really appreciate the great work put on building such a great tool for annotation.
I have been using the tool for PDFs annotation lately. Please read below my issue.
I have been using prodigy-pdfs pdf.spans.manual recipe for PDF spans annotation for quite some time, its working for most of the documents untill i came accross this specific PDF formats ( Germany Company Registry document).
Steps to reproduce:
Process the Germany Company Registry pdf document with
below configuration
docling_model_path = os.getenv("DOCLING_MODEL_PATH") #local doclng model path
pdf_pipeline_options = PdfPipelineOptions(
do_ocr=False,
do_table_structure=True,
table_structure_options={"do_cell_matching": True},
artifacts_path=docling_model_path,
)
format_options = dict(
{
InputFormat.PDF: PdfFormatOption(
pipeline_cls=StandardPdfPipeline, pipeline_options=pdf_pipeline_options, backend=PyPdfiumDocumentBackend
),
# TODO: can handle other format such as DocX, HTML, mails, etc
}
)
Out put is spacy model with more than two lack tokens (Hexadecimal token text) because of this browser is crashing with stack overflow issue.
Expected output : Plain text output tokens.
Please note : I also tried force_full_page_ocr with tessaract cli options, but no luck.
Would appreciate your help.