Will NER work to extract structured data from semi-structured OCRd PDFs?

davidefiocco · December 27, 2019, 1:03am

Hi, I came across a book archive with hundreds of entries like this:

Each entry is a book, with its author, title (in latin or ancient italian), (if applicable) volume number, book format, publication year and place of publication, occasional comments and archive identifier.
Ultimately, I would like to extract all these fields with the least possible manual effort.

Data is "semi-structured":

each book appears to be preceded by an "=" sign;
different fields come with distinctive features (e.g. authors and place of publication are upper case), formats come in "categorical" variants
some fields may be missing.

It seems reasonable that OCR on PDFs should allow to extract (noisy) text.

I still haven't made up my mind about whether I should then use a rules-based approach or a statistical one to convert the text resulting from OCR into a structured table.

Do you have any tips about a sensible strategy to tackle this problem?
Do you expect NER to work with a reasonably low number of examples (couple of hundreds, possibly bootstrapping from rules) in this context?
Being the language ancient italian/latin, are there any specific spaCy language models I should consider using?

honnibal · January 2, 2020, 1:27pm

It's pretty hard to give too much advice on this, as each project is different and I haven't personally worked on a very similar problem. You might find NER works reasonably well with little annotation, or you might find that regular expressions are more effective --- I'm not really sure.

I would suggest starting out by annotating evaluation data, which will be useful to you in either approach. It will also give you a better feel for the data so you know what to expect. Unfortunately I don't think any pretrained spaCy models will be helpful to you.

Topic		Replies	Views
Information Extraction for long, semi-unique documents ner	1	541	October 16, 2019
Correct way to annotate data in my case (Spacy newbie here) usage , ner , spacy	1	582	October 29, 2020
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	142	January 14, 2025
Invoice Parsing usage , ner , spacy	3	990	May 14, 2020
Parsing HTML Page ner	1	665	February 27, 2018

Will NER work to extract structured data from semi-structured OCRd PDFs?

Related topics