I’m still working on that PDF information extraction problem. I’m still concerned that I’m losing too much of the data as I simply convert to text, so I’ve begun working on converting the PDFs to HTML documents with the hope that this will retain a lot of the structure (font, location on page, relation to other page elements).
I know I can create a custom loader / recipe for rendering this with prodigy. If, for example, I’m creating a NER model can this take some of the font / location attributes in to consideration? One of the (many) problems I’m facing with these documents is that sometimes the key/value pairs will look like:
And other times it will look like:
Key Key Key
value value value
The only way to tell that one value is a key is usually by the bold font.
Any suggestions on how to best have this relationship recognized?
Edit: I’ve been reading up on custom attributes and properties… if I set some of the properties as the element font, would that be considered? Or is it just a way to keep the doc as the main source of truth?