Somewhat related, we've had a similar post on invoice parsing:
I'd recommend reviewing LJ's project where he uses a HuggingFace model that considers both text and image.
I'd recommend first trying to reproduce the project. You can try to clone the repo, setup the requirements (and install tesseract), you should be able to reproduce the project. And then you could try to modify the project switch out the original data with your own. This is a bit of an advanced project as the task you're doing can be a bit tricky. You may also want to use Prodigy v1.11.14 (not v1.12) as there could be breaking changes we implemented with stream in v1.12 if you reproduce this project.
Hope this helps!