Parsing HTML Page


I’m still working on that PDF information extraction problem. I’m still concerned that I’m losing too much of the data as I simply convert to text, so I’ve begun working on converting the PDFs to HTML documents with the hope that this will retain a lot of the structure (font, location on page, relation to other page elements).

I know I can create a custom loader / recipe for rendering this with prodigy. If, for example, I’m creating a NER model can this take some of the font / location attributes in to consideration? One of the (many) problems I’m facing with these documents is that sometimes the key/value pairs will look like:

Key: Value
Key: Value
Key: Value

And other times it will look like:
Key Key Key
value value value

The only way to tell that one value is a key is usually by the bold font.

Any suggestions on how to best have this relationship recognized?

Edit: I’ve been reading up on custom attributes and properties… if I set some of the properties as the element font, would that be considered? Or is it just a way to keep the doc as the main source of truth?

A quick and dirty solution so you can experiment: you could use one of the prefix, suffix or shape lexical attributes. These are used as features to construct the lexical representations, but the calculation of these features can be customised:

from spacy.attrs import PREFIX

# Set the function that will return the string feature
nlp.vocab.lex_attr_getters[PREFIX] = lambda string: ''.join(reversed(string))

# Set the existing values (they're cached per type in the vocab)
for lex in nlp.vocab:
    lex.prefix_ = nlp.vocab.lex_attr_getters[PREFIX](lex.orth_)

You can read more about how this all works here:

My meta advice would be to not worry about adding these features for now. You can always try to fine tune these things later, once the model is set up and you have an accuracy figure to improve.