Page Classification of PDF Documents

glander · January 12, 2019, 8:44pm

Hello Prodigy Community,

I’m considering purchasing Prodigy for a client that would like to classify pages of pdf documents into about 30 classes. Model would be operating on string content of pages w/ some other derived features to help it (ie, pixel saturation by page section). They have plenty of training data and a reliable number of labels for half of those classes, the remainder of which will have to be manually annotated by some interns. I’m not sure how good a fit prodigy is for this and am trying to evaluate.

I can use mupdf and pymupdf to turn pages into images and do the classification through prodigy’s image classification, but am not sure how much customizability that has from the Multiple Choise (Image) live demo. But it seems like a lot of work to implement as I’ll have to not only create a pipeline for forming the images, but one for tracking which strings corresponded to those annotations when I’m feeding back to my tf model.

So I guess the question I’m asking is, is my project the right use case for prodigy, or will I be better off just making my own ad hoc annotation client in something like pysimplegui?

Thank you for your input and I apologize if this question has been asked before, the closest I could find were the two threads linked below and neither was able to answer my question.

Web page classification
Prodigy with PDFs

honnibal · January 14, 2019, 4:39pm

Hi @glander,

In general I often find myself encouraging people to consider self-implemented alternatives. We’re definitely of the belief that one software package can’t do everything — if it could implement any solution, Prodigy would be a language, not a library!

I think what you’re trying to do is a bit outside the core use-case, so it’s possible it’ll be better to start fresh. That said, Prodigy’s pretty easy to customise, so if you want a web-based UI, I think just the management of the database/REST/CLI interaction that Prodigy provides might be useful enough to make it worth working with.

If you want to try out the back-end, I could set up a demo VM for you — send us an email at contact@explosion.ai

Best,
Matt

Topic		Replies	Views
Image classification on prodigy-pdf	1	99	June 10, 2024
Using prodigy with PDF documents usage	3	4674	February 20, 2018
Text classification - content of a web page usage , textcat , solved	2	696	August 31, 2018
text classification - is prodigy a good fit for the project? usage , textcat	2	675	October 22, 2019
Adding a helper image textcat , custom , front-end	4	410	November 10, 2022

Page Classification of PDF Documents

Related Topics