I would like to extract pages with maps from a pdf. The document contains maps of airports and pages of text. Downstream I will be extracting the map boundaries.
I am wondering if there is a method of classifying the pages in the PDF to find the maps.
Something like:
• Spacy layout to create a spacy doc object from a pdf
• With the recipe textcat.manual to Label Map, Notmap.
The output might be something like
• Page 1, notmap
• Page 2, map
Do you think that would work?
Then, having found the maps, its only really useful if I can automatically extract the pages with the maps. So the output is pdf pages with maps which I can then process.
Any thoughts on the workflow for multipagePDF to PDFmapsOnly (or even better, one pdf per map)
The final step, is to extract the map boundaries; if you think there is anything in the spacy prodigy universe that might be worth exploring, please let me know.
Here is an example document.
Hi @alphie,
If I understand correctly, your goal is to process PDFs as images only and there's no need to extract any text.
In that case, you shouldn't need spacy layout
at all. spacy layout
is mostly useful for extracting text from PDFs.
It sounds like you need an image classification model and you can prepare the training data with the pdf.image.manual recipe where you can 1) either draw bounding boxes around maps to detect boundaries or 2) set it up to classify entire pages.
For 1) you can use pdf.image.manual
as is.
For 2) you'd need to modify the recipe to use the classification
view_id. That requires changing the view_id value to "classification" in the return statement on line 113 and setting the label on the example level in the generate_pdf_pages
function:
def generate_pdf_pages(pdf_paths: List[Path], split_pages: bool = False):
"""Generate dictionaries that contain an image for each page in the PDF"""
for pdf_path in pdf_paths:
pdf = pdfium.PdfDocument(pdf_path)
n_pages = len(pdf)
pages = []
for page_number in range(n_pages):
pdf_page = pdf.get_page(page_number)
page = {
"image": page_to_image(pdf_page),
"path": str(pdf_path),
"meta": {
"title": pdf_path.name,
"page": page_number,
},
"label": "MAP" # hardcoding the label for demonstration
}
if split_pages:
yield set_hashes(page)
else:
page["view_id"] = "image_manual"
pages.append(page)
if not split_pages:
yield set_hashes(
{
"pages": pages,
"meta": {"title": pdf_path.name},
"config": {"view_id": "pages"},
}
)
pdf.close()
When in classification
mode you want to use the --split-pages
option to be able to annotate each page of the PDF.
As for the "multipagePDF to PDFmapsOnly" workflow - once you have a successful image classification model that should be just a question of data post-processing.
I recommend you check out our docs of PDF processing before starting out to make sure you're familiar with all the options available.
Bottom line is that you need a computer vision solution rather than an NLP solution so while you can defnitely prepare your data using Prodigy, you should be looking for dedicated frameworks for the training phase such as PyTorch or TensorFlow with Keras.
Thank you, that's really helpful