Problem with path in pdf.image.manual

Hi,

I'm trying to initiate annotation using pdf.image.manual, but it's complaining about the following.

prod/lib/python3.10/site-packages/prodigy_pdf/__init__.py", line 58, in pdf_image_manual
    if not pdf_folder.exists():
AttributeError: 'str' object has no attribute 'exists'

From what I can understand, it has to do with the path to the PDF, but it is correct. I've tried both absolute and relative paths. What should I do?

It runs inside a virtual env and im running the prodigy command from the main folder and have the pdf folder as a child to the main one.

here is the command:

prodigy pdf.image.manual datasetname pdf_folder --labels 1,2,3,4

This was totally a bug on our end, sorry about that! I'm a bit surprised this one slipped through CI. I just made a quick patch and will check CI right away. You should be able to uninstall/install and this issue should be gone.

The problem was that we didn't parse the path to a Path object. Notice this change. At the end of the pdf_folder line.

before

@recipe(
    "pdf.image.manual",
    # fmt: off
    dataset=("Dataset to save answers to", "positional", None, str),
    pdf_folder=("Folder with PDFs to annotate", "positional", None, str),
    labels=("Comma seperated labels to use", "option", "l", str),
    remove_base64=("Remove base64-encoded image data", "flag", "R", bool)
    # fmt: on
)
def pdf_image_manual(
    dataset: str,
    pdf_folder: Path,
    labels:str,
    remove_base64:bool=False
) -> ControllerComponentsDict:

now

@recipe(
    "pdf.image.manual",
    # fmt: off
    dataset=("Dataset to save answers to", "positional", None, str),
    pdf_folder=("Folder with PDFs to annotate", "positional", None, Path),
    labels=("Comma seperated labels to use", "option", "l", str),
    remove_base64=("Remove base64-encoded image data", "flag", "R", bool)
    # fmt: on
)
def pdf_image_manual(
    dataset: str,
    pdf_folder: Path,
    labels:str,
    remove_base64:bool=False
) -> ControllerComponentsDict:

If there are any other issues, do let me know!

Thank you! I'm able to annotate now. After saving the dataset I'm having issues wht the pdf.ocr.correct
Here is my error when trying to start a round of manual correction.

prodigy pdf.ocr.correct test pdf_anno --labels text --fold-dashes

Provided source: 'pdf_anno' was resolved as a Path but does not exist. If
this is not a Path, try specifying an explicit loader. Otherwise, ensure the
Path exists.

For some reason it's recognized as a path instead of a dataset source?

1 Like

Ah yeah, you'll want to do

prodigy pdf.ocr.correct test dataset:pdf_anno --labels text --fold-dashes

With you don't add the dataset: prefix, Prodigy will assume you're referring to a file on disk. Does this help?

Thank you! It's working as expected now! :blush:
One final question on the topic. When annotating non english characters it will not recognize them and they got replaced.
å = a
ä = a

What can I do about it?

This may be a tesseract issue which we can't perfectly tune. That said, there are a few ideas.

  1. You can upscale the image some more. That might help. To do this, you'd want to use the --scale flag documented here.
  2. Alternatively, you can also copy the code from Github and make changes to it. It seems there are downloadable language packs here, which are described in this tutorial. I'm a bit out of my comfort zone making strong recommendations here, but it feels like a direction that could help.

If you stumble on a solution, I'd love to be kept in the loop as it may serve as an inspiration to improve the recipe.