Hello,
I'm trying to run the pdf.layout.fetch recipe to process a directory of PDFs, but during the run I see:
Traceback (most recent call last):
[...omitted...]
136, in get_stream
yield from self.get_full_stream()
File "/Users/budak/.pyenv/versions/ezdeposit/lib/python3.12/site-packages/prodigy_pdf/spans.py", line 153, in get_full_stream
doc[page_spans[0].start : page_spans[-1].end],
~~~~~~~~~~^^^
IndexError: list index out of range
I've narrowed it down to a particular PDF that causes this issue, which you can download at https://discovery.ucl.ac.uk/id/eprint/10072810/1/042418eqs101o.pdf (it is an open access journal article). The final pages include some strangely-sized pages that consist only of figures; maybe this is causing the problem?
If I load the PDF on its own using spacy_layout
, I don't get this error (i.e. this works):
nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
doc = layout("042418eqs101o.pdf")
I'm using:
- spaCy 3.7.5
- prodigy 1.17.2
- spacy_layout 0.0.9
- prodigy_pdf 0.4.0
- docling 2.13.0
On an Apple M3 Pro, MacOS 14.7.1.