Bug: IndexError from pdf.layout.fetch recipe

Hello,

I'm trying to run the pdf.layout.fetch recipe to process a directory of PDFs, but during the run I see:

Traceback (most recent call last):
[...omitted...]
136, in get_stream
    yield from self.get_full_stream()
  File "/Users/budak/.pyenv/versions/ezdeposit/lib/python3.12/site-packages/prodigy_pdf/spans.py", line 153, in get_full_stream
    doc[page_spans[0].start : page_spans[-1].end],
        ~~~~~~~~~~^^^
IndexError: list index out of range

I've narrowed it down to a particular PDF that causes this issue, which you can download at https://discovery.ucl.ac.uk/id/eprint/10072810/1/042418eqs101o.pdf (it is an open access journal article). The final pages include some strangely-sized pages that consist only of figures; maybe this is causing the problem?

If I load the PDF on its own using spacy_layout, I don't get this error (i.e. this works):

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
doc = layout("042418eqs101o.pdf")

I'm using:

  • spaCy 3.7.5
  • prodigy 1.17.2
  • spacy_layout 0.0.9
  • prodigy_pdf 0.4.0
  • docling 2.13.0

On an Apple M3 Pro, MacOS 14.7.1.

It does seem like the cause is some of the ending pages, which have no spans available:

>>> for i, (page_layout, page_spans) in enumerate(doc._.get(layout.attrs.doc_pages)):
>>>     print(i, page_spans)
...
26 [TABLE]
27 [TABLE]
28 [TABLE]
29 [TABLE]
30 [TABLE]
31 []
32 []
33 []
34 [1, 1, 1, b)]
35 []
36 []