Parsing HTML Page

KevinJ90825 · February 26, 2018, 9:25pm

Hi!

I’m still working on that PDF information extraction problem. I’m still concerned that I’m losing too much of the data as I simply convert to text, so I’ve begun working on converting the PDFs to HTML documents with the hope that this will retain a lot of the structure (font, location on page, relation to other page elements).

I know I can create a custom loader / recipe for rendering this with prodigy. If, for example, I’m creating a NER model can this take some of the font / location attributes in to consideration? One of the (many) problems I’m facing with these documents is that sometimes the key/value pairs will look like:

Key: Value
Key: Value
Key: Value

And other times it will look like:
Key Key Key
value value value

The only way to tell that one value is a key is usually by the bold font.

Any suggestions on how to best have this relationship recognized?

Edit: I’ve been reading up on custom attributes and properties… if I set some of the properties as the element font, would that be considered? Or is it just a way to keep the doc as the main source of truth?

honnibal · February 27, 2018, 12:21am

A quick and dirty solution so you can experiment: you could use one of the prefix, suffix or shape lexical attributes. These are used as features to construct the lexical representations, but the calculation of these features can be customised:

from spacy.attrs import PREFIX

# Set the function that will return the string feature
nlp.vocab.lex_attr_getters[PREFIX] = lambda string: ''.join(reversed(string))

# Set the existing values (they're cached per type in the vocab)
for lex in nlp.vocab:
    lex.prefix_ = nlp.vocab.lex_attr_getters[PREFIX](lex.orth_)

You can read more about how this all works here: https://spacy.io/usage/spacy-101#vocab

My meta advice would be to not worry about adding these features for now. You can always try to fine tune these things later, once the model is set up and you have an accuracy figure to improve.

Topic		Replies	Views
Information Extraction for long, semi-unique documents ner	1	540	October 16, 2019
Using prodigy with PDF documents usage	3	4767	February 20, 2018
spaCy, prodigy, annotation usage , ner , solved	2	722	February 8, 2019
Correct way to annotate data in my case (Spacy newbie here) usage , ner , spacy	1	582	October 29, 2020
Will NER work to extract structured data from semi-structured OCRd PDFs? usage , spacy	1	567	January 2, 2020

Parsing HTML Page

Related topics