Grouping images for PDF file annotation

Dear spacy/prodigy team,

first of all, thanks a lot for the great products. I really love the simplicity and still extendability of your tools.

I would like to annotate pdf files by annotating the images of the converted pdf pages.
Sometimes a PDF might have up to 30 pages, thus 30 images.

Is there a way how I can group the images so that annotation can be done by going group after group?
e.g. pdf_file1: image_01,image_02,image_03,...

something like keeping the images grouped in the history pane on the left side, so that an annotator can easily go forth and back within the group.

Or is it possible to generate a dataset for each pdf file consisting of the images in the pdf file and then start annotation of multiple dataset?

Thnx for your feedback.

Hi and thanks :smiley:

I think there are several ways to make this work and it depends on the details of your use case. When you load in your files, do they show up in alphabetical order? (Under the hood, Prodigy calls path.iterdir, which I thought was alphabetic – but turns out that depends on your operating system).

Instead of loading in a directory of images, you could also load in a JSONL file (and set --loader jsonl when you start image.manual). If an image task specifies a "text", that's going to be used in the history pane in the sidebar. So you could use that to give the history entries more readable names, and make sure the files are presented in the right order:

{"image": "group_01/image_01.jpg", "text": "G1: 01"}
{"image": "group_01/image_02.jpg", "text": "G1: 02"}

You can also include other metadata in the tasks – e.g. "group": 1. This will be passed through with the data and saved in the database (and you can later use that to more easily filter your annotations).

One thing to note about the history that it will always show you the 10 most recently edited examples (which makes sense). So if an annotator clicks on an older annotation to change it, it will show up on top, because it was most recently edited. So you couldn't rely on the history always showing the original group order.

Hi Ines,
thnx a lot for your reply, that already helps a lot.

Would it also be possible to generate a dataset for each group of images e.g.

dataset_pdf2

  • image-01
  • image-02
  • image-03

dataset_pdf2

  • image-01
  • image-02
  • image-03

and then start prodigy with multiple datasets?

Is there a way to keep the history ordering on the left to the ordering of the jsonl stream? and keep more then 10 files in there?

I understand that multiimage labeling (like for PDF documents) was probably not in scope, but maybe I can code it in with some hints.

There are so many use cases when it comes to document information extraction, and prodigy is far better than most other tools I tried out.

By the way, you and Matthew have great hair! :grinning:

Datasets in Prodigy are what annotations are saved to – to load in data, you just stream it in from a directory, a file or a Python script. So you can set up your data like I described above and stream in page by page from multiple PDFs in order, however you like. That's a pretty standard use case.

I'm not 100% sure I understand what the problem is or what's still missing. If you stream in your pages in order, the annotator can work on them and if they need to go back to a previous page, they can go back, check something or make a correction, and then move on.

The history shows the most recently edited or created annotations, in chronological order. This is typically what's expected from the annotation history – if it wasn't showing you the annotations in "historical" order, it would be pretty confusing. The number of items kept in the history are examples that haven't yet been sent back to the server – so making the history longer would mean delaying sending examples back (see my comment here for details).

Thnx for the support.

Using the "text" key to provide the name is already very useful for the user to understand the ordering.

1 Like