Hi, I came across a book archive with hundreds of entries like this:
Each entry is a book, with its author, title (in latin or ancient italian), (if applicable) volume number, book format, publication year and place of publication, occasional comments and archive identifier.
Ultimately, I would like to extract all these fields with the least possible manual effort.
Data is "semi-structured":
- each book appears to be preceded by an "=" sign;
- different fields come with distinctive features (e.g. authors and place of publication are upper case), formats come in "categorical" variants
- some fields may be missing.
It seems reasonable that OCR on PDFs should allow to extract (noisy) text.
I still haven't made up my mind about whether I should then use a rules-based approach or a statistical one to convert the text resulting from OCR into a structured table.
- Do you have any tips about a sensible strategy to tackle this problem?
- Do you expect NER to work with a reasonably low number of examples (couple of hundreds, possibly bootstrapping from rules) in this context?
- Being the language ancient italian/latin, are there any specific spaCy language models I should consider using?