Image classification on prodigy-pdf

zkl · June 4, 2024, 4:45pm

Hello,

Is there any plan to adding image classification to the prodigy-pdf plugin, where instead of drawing bounding boxes on elements of a page, the entire page is labeled with zero, one, or more categories?

Thanks!

magdaaniol · June 10, 2024, 9:58am

Hi @zkl,

We haven't planned on adding classification of entire PDF pages atm, but it's a feature you could very easily add yourself!
Remember that prodigy-pdf is open source, so you can check out the code from the repo and add the following recipe inside the __init__.py file:

from prodigy.types import StreamType
@recipe(
    "pdf.imagecat",
    # fmt: off
    dataset=("Dataset to save answers to", "positional", None, str),
    pdf_folder=("Folder with PDFs to annotate", "positional", None, Path),
    remove_base64=("Remove base64-encoded image data", "flag", "R", bool)
    # fmt: on
)
def pdf_imagecat(
    dataset: str, pdf_folder: Path,remove_base64: bool = False
) -> ControllerComponentsDict:
    """Turns pdfs into images in order to annotate them."""
    # Read in stream as a list for progress bar.
    if not Path(pdf_folder).exists():
        msg.fail(f"Folder `{pdf_folder}` does not exist.", exits=True)
    pdf_paths = list(Path(pdf_folder).glob("*.pdf"))
    if len(pdf_paths) == 0:
        msg.fail("Did not find any .pdf files in folder.")
    source = Stream.from_iterable(pdf_paths).apply(generate_pdf_pages)

    def before_db(examples):
        # Remove all data URIs before storing example in the database
        for eg in examples:
            if eg["image"].startswith("data:"):
                del eg["image"]
        return examples

    def add_options(stream: StreamType, options: List[Dict]) -> StreamType:
        for eg in stream:
            eg["options"] = options
            yield eg

    # define labels to be used for classification
    options = [
        {"id": 0, "text": "news"},
        {"id": 1, "text": "sport"},
        {"id": 2, "text": "business"},
        {"id": 3, "text": "science"},
        {"id": -1, "text": "other"},
    ]
    # add options to the task for `choice` UI
    stream = source.apply(add_options, stream=source, options=options)

    return {
        "dataset": dataset,
        "stream": stream,
        "before_db": before_db if remove_base64 else None,
        "view_id": "choice",
        "config": {
            "choice_style": "single",  # or multiple
            "choice_auto_accept": True,
        },
    }

Note this reuses the plugin's function for converting the PDFs into images, adds options with labels to the stream and renders everything in the choice UI.
Once you have the recipe inside the __init__.py file, you would just reinstall the plugin to the target virtual environment. So, with the target virtual environment activated and inside the checked out prodigy-pdf folder:

python -m pip install -e .

The added recipe pdf.imagecat should be now ready to use with Prodigy like so:

 python -m prodigy pdf.imagecat test pdfs -R

Topic		Replies	Views
Page Classification of PDF Documents usage , custom	1	942	January 14, 2019
What's a recipe for (dead) simple binary (or multiclass) image classification? usage , image , custom	2	693	October 23, 2019
Annotating PDFs by drawing bounding box around fields usage , front-end	1	2660	February 27, 2019
Usecase of Prodigy-PDF ner	1	341	February 8, 2024
Adding a helper image textcat , custom , front-end	4	419	November 10, 2022

Image classification on prodigy-pdf

Related topics