Image classification on prodigy-pdf


Is there any plan to adding image classification to the prodigy-pdf plugin, where instead of drawing bounding boxes on elements of a page, the entire page is labeled with zero, one, or more categories?


Hi @zkl,

We haven't planned on adding classification of entire PDF pages atm, but it's a feature you could very easily add yourself!
Remember that prodigy-pdf is open source, so you can check out the code from the repo and add the following recipe inside the file:

from prodigy.types import StreamType
    # fmt: off
    dataset=("Dataset to save answers to", "positional", None, str),
    pdf_folder=("Folder with PDFs to annotate", "positional", None, Path),
    remove_base64=("Remove base64-encoded image data", "flag", "R", bool)
    # fmt: on
def pdf_imagecat(
    dataset: str, pdf_folder: Path,remove_base64: bool = False
) -> ControllerComponentsDict:
    """Turns pdfs into images in order to annotate them."""
    # Read in stream as a list for progress bar.
    if not Path(pdf_folder).exists():"Folder `{pdf_folder}` does not exist.", exits=True)
    pdf_paths = list(Path(pdf_folder).glob("*.pdf"))
    if len(pdf_paths) == 0:"Did not find any .pdf files in folder.")
    source = Stream.from_iterable(pdf_paths).apply(generate_pdf_pages)

    def before_db(examples):
        # Remove all data URIs before storing example in the database
        for eg in examples:
            if eg["image"].startswith("data:"):
                del eg["image"]
        return examples

    def add_options(stream: StreamType, options: List[Dict]) -> StreamType:
        for eg in stream:
            eg["options"] = options
            yield eg

    # define labels to be used for classification
    options = [
        {"id": 0, "text": "news"},
        {"id": 1, "text": "sport"},
        {"id": 2, "text": "business"},
        {"id": 3, "text": "science"},
        {"id": -1, "text": "other"},
    # add options to the task for `choice` UI
    stream = source.apply(add_options, stream=source, options=options)

    return {
        "dataset": dataset,
        "stream": stream,
        "before_db": before_db if remove_base64 else None,
        "view_id": "choice",
        "config": {
            "choice_style": "single",  # or multiple
            "choice_auto_accept": True,

Note this reuses the plugin's function for converting the PDFs into images, adds options with labels to the stream and renders everything in the choice UI.
Once you have the recipe inside the file, you would just reinstall the plugin to the target virtual environment. So, with the target virtual environment activated and inside the checked out prodigy-pdf folder:

python -m pip install -e . 

The added recipe pdf.imagecat should be now ready to use with Prodigy like so:

 python -m prodigy pdf.imagecat test pdfs -R