pdf.spans.manual

alphie · November 29, 2024, 7:55am

I have come across your very interesting looking new prodigy tool pdf.spans.manual.
I am working on extracting structured data from pdfs. At present I extract text with a python library pdf plumber and then run NLP on the text. It works up to a point.
I have previously tried the pdf plugin, but decided not to pursue the work involved to create a computer vision approach at that time, and have been focussing on extracting data using NLP. This does not take account of the document layout and as my pdfs have a strong structure, that can be a problem.
I am wondering about exploring your new tool, but I am not quite sure what it will give me.
In particular, if I annotate with pdf.spans,manual, do I get an output I can then directly use to create a model, without going down the computer vision route.

magdaaniol · December 2, 2024, 1:09pm

Hi @alphie,

Thanks for your interest in pdf.spans.manual! Let me explain how it can help with your PDF extraction.
pdf.spans.manual uses spacy-layout (which, in turn uses Docling) to parse PDFs and store its components i.e. text, tables, and the layout information in the spaCy Doc data structure. This makes it directly usable for annotation with Prodigy and, then for training of spaCy components just like any other Prodigy span recipe.
Importantly, spacy-layout stores separately text and the layout information which means it supports both text-only use cases as well as layout-aware processing.
If the layout is irrelevant in your data, you can just focus on the text extracted from PDFs for both annotation and training. If, however, you know the relevant sections up front (which is your case I believe) you can configure the recipe to focus just on these sections for annotations. See spacy-layout documentation to check what section labels are available.
In other words, pdf.spans.manual allows you to "abstract away" the fact that your data is in PDF and focus just on the text if that's what you need and/or leverage structured information about the layout if that's useful.

In production, similarly, you'll need a spaCy pipeline that is composed of a spacy-layout component and your trained model that will be applied to the output of this spacy-layout component.
So yes, if spacy-layout provides good parses of your data, you should be able to skip the computer vision part and work with pdf.spans.manual directly. This recipe will produce a dataset that you can use directly for training NER or spancat components with Prodigy or spaCy.
I also would like to recommend the latest blog from @ines that discusses PDF processing capabilities of Prodigy and spaCy in detail

Topic		Replies	Views
Document layout analysis usage , image , custom	6	1162	March 10, 2021
Annotation strategy for varied pdf layouts	8	78	August 29, 2024
finding areas on pdfs for downstream training	3	91	July 19, 2024
Taking a Computer Vision Approach (leveraging image.manual) to build a custom NER model on PDFs usage , ner , image	3	583	July 28, 2022
Adding a helper image textcat , custom , front-end	4	421	November 10, 2022

pdf.spans.manual

Related topics