How to embed a PDF file to a recipe

Hello @ljvmiranda921, thanks a lot for responding! I forgot to add set_hashes to the input data before saving it to jsonl, after I added this line everything worked, so the problem is solved :slight_smile:

However, if I already have you here I wanted to ask a quick question (didn't want to spawn another topic, but can move it if it would be nessecary) - I wanted to embed pdf file into this scheme (to compare two pdf files, and not two strings), and as input data pass dict with relative paths to the pdf files, like that:

{"pdf1": "file_1.pdf", "pdf2": "file_2.pdf"}
{"pdf1": "file_3.pdf", "pdf2": "file_4.pdf"}

and embed them in the scheme using i.e.:

    html_template = f"""
    <embed src="{{pdf1}}" width="600px" height="2100px" />
    <br/><br/>
    <embed src="{{pdf2}}" width="600px" height="2100px" />
    """

After using relative file paths for each pdf (the pdfs are currently in the same folder as the folder from which I start the prodigy) I receive following message instead of the pdf file:

{"detail":"Not Found"}

Additionally, when using full file path as: file:///full/file/path/file.pdf Nothing appears. Additionally, when I uploaded my document on google drive and used shareable link (with the link hard encoded in recipe in both documents), it displayed everything properly. It's quite baffling, as I thought changing strings to embedded pdfs will be quick and easy, and maybe there is something really silly I am missing, so I'll be thankful for pointing in any direction :slight_smile:

Hi @ra-v :slight_smile: I moved your post to make it more searchable, anyway:

Due to how most browsers work, specifying a local file URL alone isn't the best solution. You have two options here:

  1. Use a local web server to host your files. The easiest way to do this is through python (maybe using the http.server command) and use the localhost paths.
  2. Use a URL. One way is to store your files from an S3 or GCS bucket, ensure that the URL is accessible by your machine, and link it from there.

C.f. choice of audios - #2 by ines

Thanks a lot for the solutions! I actually managed to solve it even differently - I read the file into python, and save it in the jsonl file that I pass to stream in recipe.py

with open('pdf_file.pdf', "r") as f:
    pdf = base64.b64encode(f.read()).decode('utf-8')

This way I don't have to upload all files into any type of server :slight_smile:

Gotcha! For posterity, note that that solution works, but may impact storage space given that PDF files tend to be big :smiley:

Yeah, that's true, will use your propositions if they turn out to be too heavy :smiley: