How to embed a PDF file to a recipe

ra-v · December 17, 2021, 8:55am

Hello @ljvmiranda921, thanks a lot for responding! I forgot to add set_hashes to the input data before saving it to jsonl, after I added this line everything worked, so the problem is solved

However, if I already have you here I wanted to ask a quick question (didn't want to spawn another topic, but can move it if it would be nessecary) - I wanted to embed pdf file into this scheme (to compare two pdf files, and not two strings), and as input data pass dict with relative paths to the pdf files, like that:

{"pdf1": "file_1.pdf", "pdf2": "file_2.pdf"}
{"pdf1": "file_3.pdf", "pdf2": "file_4.pdf"}

and embed them in the scheme using i.e.:

    html_template = f"""
    <embed src="{{pdf1}}" width="600px" height="2100px" />
    <br/><br/>
    <embed src="{{pdf2}}" width="600px" height="2100px" />
    """

After using relative file paths for each pdf (the pdfs are currently in the same folder as the folder from which I start the prodigy) I receive following message instead of the pdf file:

{"detail":"Not Found"}

Additionally, when using full file path as: file:///full/file/path/file.pdf Nothing appears. Additionally, when I uploaded my document on google drive and used shareable link (with the link hard encoded in recipe in both documents), it displayed everything properly. It's quite baffling, as I thought changing strings to embedded pdfs will be quick and easy, and maybe there is something really silly I am missing, so I'll be thankful for pointing in any direction

ljvmiranda921 · December 17, 2021, 2:38pm

Hi @ra-v I moved your post to make it more searchable, anyway:

Due to how most browsers work, specifying a local file URL alone isn't the best solution. You have two options here:

Use a local web server to host your files. The easiest way to do this is through python (maybe using the http.server command) and use the localhost paths.
Use a URL. One way is to store your files from an S3 or GCS bucket, ensure that the URL is accessible by your machine, and link it from there.

C.f. choice of audios - #2 by ines

ra-v · December 20, 2021, 10:48am

Thanks a lot for the solutions! I actually managed to solve it even differently - I read the file into python, and save it in the jsonl file that I pass to stream in recipe.py

with open('pdf_file.pdf', "r") as f:
    pdf = base64.b64encode(f.read()).decode('utf-8')

This way I don't have to upload all files into any type of server

ljvmiranda921 · December 20, 2021, 11:37am

Gotcha! For posterity, note that that solution works, but may impact storage space given that PDF files tend to be big

ra-v · December 20, 2021, 11:39am

Yeah, that's true, will use your propositions if they turn out to be too heavy

Topic		Replies	Views
Extracting data from PDFs using prodigy usage , solved , best-practices	2	1104	June 24, 2022
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	155	January 14, 2025
file:// urls don't display as links enhancement , front-end	2	1022	April 24, 2018
Using annotation form data to fill HTML value usage , image , custom	1	429	January 13, 2022
Problem with path in pdf.image.manual	5	225	October 30, 2023

How to embed a PDF file to a recipe

Related topics