Hello,
Is there any plan to adding image classification to the prodigy-pdf plugin, where instead of drawing bounding boxes on elements of a page, the entire page is labeled with zero, one, or more categories?
Thanks!
Hello,
Is there any plan to adding image classification to the prodigy-pdf plugin, where instead of drawing bounding boxes on elements of a page, the entire page is labeled with zero, one, or more categories?
Thanks!
Hi @zkl,
We haven't planned on adding classification of entire PDF pages atm, but it's a feature you could very easily add yourself!
Remember that prodigy-pdf
is open source, so you can check out the code from the repo and add the following recipe inside the __init__.py
file:
from prodigy.types import StreamType
@recipe(
"pdf.imagecat",
# fmt: off
dataset=("Dataset to save answers to", "positional", None, str),
pdf_folder=("Folder with PDFs to annotate", "positional", None, Path),
remove_base64=("Remove base64-encoded image data", "flag", "R", bool)
# fmt: on
)
def pdf_imagecat(
dataset: str, pdf_folder: Path,remove_base64: bool = False
) -> ControllerComponentsDict:
"""Turns pdfs into images in order to annotate them."""
# Read in stream as a list for progress bar.
if not Path(pdf_folder).exists():
msg.fail(f"Folder `{pdf_folder}` does not exist.", exits=True)
pdf_paths = list(Path(pdf_folder).glob("*.pdf"))
if len(pdf_paths) == 0:
msg.fail("Did not find any .pdf files in folder.")
source = Stream.from_iterable(pdf_paths).apply(generate_pdf_pages)
def before_db(examples):
# Remove all data URIs before storing example in the database
for eg in examples:
if eg["image"].startswith("data:"):
del eg["image"]
return examples
def add_options(stream: StreamType, options: List[Dict]) -> StreamType:
for eg in stream:
eg["options"] = options
yield eg
# define labels to be used for classification
options = [
{"id": 0, "text": "news"},
{"id": 1, "text": "sport"},
{"id": 2, "text": "business"},
{"id": 3, "text": "science"},
{"id": -1, "text": "other"},
]
# add options to the task for `choice` UI
stream = source.apply(add_options, stream=source, options=options)
return {
"dataset": dataset,
"stream": stream,
"before_db": before_db if remove_base64 else None,
"view_id": "choice",
"config": {
"choice_style": "single", # or multiple
"choice_auto_accept": True,
},
}
Note this reuses the plugin's function for converting the PDFs into images, adds options with labels to the stream and renders everything in the choice
UI.
Once you have the recipe inside the __init__.py
file, you would just reinstall the plugin to the target virtual environment. So, with the target virtual environment activated and inside the checked out prodigy-pdf
folder:
python -m pip install -e .
The added recipe pdf.imagecat
should be now ready to use with Prodigy like so:
python -m prodigy pdf.imagecat test pdfs -R