I have come across your very interesting looking new prodigy tool pdf.spans.manual.
I am working on extracting structured data from pdfs. At present I extract text with a python library pdf plumber and then run NLP on the text. It works up to a point.
I have previously tried the pdf plugin, but decided not to pursue the work involved to create a computer vision approach at that time, and have been focussing on extracting data using NLP. This does not take account of the document layout and as my pdfs have a strong structure, that can be a problem.
I am wondering about exploring your new tool, but I am not quite sure what it will give me.
In particular, if I annotate with pdf.spans,manual, do I get an output I can then directly use to create a model, without going down the computer vision route.
Hi @alphie,
Thanks for your interest in pdf.spans.manual
! Let me explain how it can help with your PDF extraction.
pdf.spans.manual
uses spacy-layout
(which, in turn uses Docling
) to parse PDFs and store its components i.e. text, tables, and the layout information in the spaCy Doc data structure. This makes it directly usable for annotation with Prodigy and, then for training of spaCy components just like any other Prodigy span recipe.
Importantly, spacy-layout
stores separately text and the layout information which means it supports both text-only use cases as well as layout-aware processing.
If the layout is irrelevant in your data, you can just focus on the text extracted from PDFs for both annotation and training. If, however, you know the relevant sections up front (which is your case I believe) you can configure the recipe to focus just on these sections for annotations. See spacy-layout
documentation to check what section labels are available.
In other words, pdf.spans.manual
allows you to "abstract away" the fact that your data is in PDF and focus just on the text if that's what you need and/or leverage structured information about the layout if that's useful.
In production, similarly, you'll need a spaCy pipeline that is composed of a spacy-layout
component and your trained model that will be applied to the output of this spacy-layout
component.
So yes, if spacy-layout
provides good parses of your data, you should be able to skip the computer vision part and work with pdf.spans.manual
directly. This recipe will produce a dataset that you can use directly for training NER or spancat components with Prodigy or spaCy.
I also would like to recommend the latest blog from @ines that discusses PDF processing capabilities of Prodigy and spaCy in detail