Bug: IndexError from pdf.layout.fetch recipe

Hello,

I'm trying to run the pdf.layout.fetch recipe to process a directory of PDFs, but during the run I see:

Traceback (most recent call last):
[...omitted...]
136, in get_stream
    yield from self.get_full_stream()
  File "/Users/budak/.pyenv/versions/ezdeposit/lib/python3.12/site-packages/prodigy_pdf/spans.py", line 153, in get_full_stream
    doc[page_spans[0].start : page_spans[-1].end],
        ~~~~~~~~~~^^^
IndexError: list index out of range

I've narrowed it down to a particular PDF that causes this issue, which you can download at https://discovery.ucl.ac.uk/id/eprint/10072810/1/042418eqs101o.pdf (it is an open access journal article). The final pages include some strangely-sized pages that consist only of figures; maybe this is causing the problem?

If I load the PDF on its own using spacy_layout, I don't get this error (i.e. this works):

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
doc = layout("042418eqs101o.pdf")

I'm using:

  • spaCy 3.7.5
  • prodigy 1.17.2
  • spacy_layout 0.0.9
  • prodigy_pdf 0.4.0
  • docling 2.13.0

On an Apple M3 Pro, MacOS 14.7.1.

It does seem like the cause is some of the ending pages, which have no spans available:

>>> for i, (page_layout, page_spans) in enumerate(doc._.get(layout.attrs.doc_pages)):
>>>     print(i, page_spans)
...
26 [TABLE]
27 [TABLE]
28 [TABLE]
29 [TABLE]
30 [TABLE]
31 []
32 []
33 []
34 [1, 1, 1, b)]
35 []
36 []

Welcome to the forum @budak :wave:

Thank you for the detailed bug report!
The root cause for the KeyError you observed was the fact that, indeed, the recipe was assuming there will always be some spans on documents output by docling.
We have just shipped patches to prodigy-pdf and spacy-layout as well.
Could you please upgrade prodigy-pdf to v0.4.2 and spacy-layout to v0.0.11 and see if that solves your issue?
Thank you.

Hello,

I've upgraded to those versions and the PDF processes without erroring now. Thanks so much!

Glad to hear that! Thanks for reporting back!