Bug description:
When trying to run a pdf through a Prodigy PDF module (eg. pdf.layout.fetch) the system will start processing the pdf, but ultimately error out with the below.
Through online research, several users have received similar errors (outside of Prodigy use) with a common solution being to downgrade transformers==4.38.2
However, doing this leads to a dependency conflict as docling-ibm-models 3.1.0 requires transformers<5.0.0,>=4.42.0.
Reproduction steps:
Clean install of prodigy 1.17.3 and prodigy pdf and running a pdf through it.
Environment variables:
Version 1.17.3
License Type Prodigy Personal
Location D:\Programs\anaconda3\envs\env311_nlp\Lib\site-packages\prodigy
Prodigy Home C:\Users\sinad\.prodigy
Platform Windows-10-10.0.19045-SP0
Python Version 3.11.11
spaCy Version 3.7.5
Database Name SQLite
Database Id sqlite
Total Datasets 3
Total Sessions 25
The issue actually comes from a potential resource conflict between pypdfium2 and docling when they try to access the same PDF file
It can be fixed by explicitly closing the PDF file once it's not used anymore by pypdfium2.
Could you check if version 0.4.1 of the plugin solves this issue?
Also a heads-up that while working on this issue I found a small bug in spacy-layout where the document layout pagination and span layout pagination might be mismatched. You also might watch for an update from spacy-layout( PR)
Unfortunately, after updating to version 0.4.1 the issue (KeyError: 1) still remains. Additionally, I am testing on documents greater than 1 page to account for the separate potential spacy-layout issue.
Thanks for the quick feedback!
The Cannot close object, library is destroyed message is not showing up anymore, right?
The Key Error is related to the spacy-layout issue. If there are layout spans on pg 1 of the PDF, they will cause the KeyError as pg 1 won't exist on the document layout.
I'll let you know as soon as the patch to spacy-layout is released. Sorry about that!
Thanks for the update! Happy to report that we have already patched spacy-layout.
You should be able upgrade to v0.0.11 in place to get the fix.
In the meantime, we have also fixed one more outstanding issue in prodigy-pdf plugin, so it's recommended to upgrade this as well to v0.4.2. Thanks again for the report!