Taking a Computer Vision Approach (leveraging image.manual) to build a custom NER model on PDFs

koaning · July 28, 2022, 9:27am

Hi Jetson, here are the answers to your questions:

Yes, correct. Prodigy doesn't natively support PDFs. However, you can choose to write your own custom recipe that is able to use them. That way, you could consider using a Python package that can parse .pdf files. You might be able to consider this if your pdfs follow a very strict structure, but the image OCR path seems like a more common approach.
An annotated image will have bounding boxes with the data format described here.
You may appreciate this answer if you want to auto-tag images.

Topic		Replies	Views
Medical PDF model ner , project	2	269	October 27, 2023
Adding a helper image textcat , custom , front-end	4	418	November 10, 2022
Clarifying Questions Regarding Auto-labeling (Active Learning, Ner.teach, Multiple Occurrences)	3	361	July 29, 2022
Image segmentation (bounding boxes) for textual images image	9	2908	March 29, 2021
finding areas on pdfs for downstream training	3	91	July 19, 2024