Will NER work to extract structured data from semi-structured OCRd PDFs?

Hi, I came across a book archive with hundreds of entries like this:

Each entry is a book, with its author, title (in latin or ancient italian), (if applicable) volume number, book format, publication year and place of publication, occasional comments and archive identifier.
Ultimately, I would like to extract all these fields with the least possible manual effort.

Data is "semi-structured":

  • each book appears to be preceded by an "=" sign;
  • different fields come with distinctive features (e.g. authors and place of publication are upper case), formats come in "categorical" variants
  • some fields may be missing.

It seems reasonable that OCR on PDFs should allow to extract (noisy) text.

I still haven't made up my mind about whether I should then use a rules-based approach or a statistical one to convert the text resulting from OCR into a structured table.

  • Do you have any tips about a sensible strategy to tackle this problem?
  • Do you expect NER to work with a reasonably low number of examples (couple of hundreds, possibly bootstrapping from rules) in this context?
  • Being the language ancient italian/latin, are there any specific spaCy language models I should consider using?

It's pretty hard to give too much advice on this, as each project is different and I haven't personally worked on a very similar problem. You might find NER works reasonably well with little annotation, or you might find that regular expressions are more effective --- I'm not really sure.

I would suggest starting out by annotating evaluation data, which will be useful to you in either approach. It will also give you a better feel for the data so you know what to expect. Unfortunately I don't think any pretrained spaCy models will be helpful to you.

1 Like