I believe a question like this has been posted before however im just looking for further clarifications.
Im trying to extract certain information from different invoice templates (all in pdf) such as invoice no, account no, date, total charges etc. Some of these data are in a table and some are just native.
Below are 2 screenshots from 2 different pdfs, an example where Id want to take the total amount due. However, I also realised that some fields that I wish to extract can have different naming conventions across different invoices, such as instead of invoice no, it can be bill no instead hence im not too sure if regexp will be of much help.
Im not sure if I should just use rule-based matching or train a statistical custom ner model.
If I do the former approach, should I create a key-value pair dictionary?
And if i do the latter approach, I have super limited datasets so Im really not too sure how I can go about training my model.
Any help or guidance would be greatly appreciated.
Hi! If you can use a rule-based system (even just regular expressions), that's definitely an approach I would try first, as it's easy to benchmark and the outcome is much more predictable and less experimental/arbitrary. Even if you end up experimenting with machine learning approaches, you typically want to know what baseline accuracy you need to beat.
I think regular expressions might work better here than spaCy's rule-based matcher, since you're mostly dealing with arbitrary tabular data and less with natural language text where you can take advantage of context, word types and so on.
From what you describe, it doesn't sound like NER would be a good fit at all. Named entity recognition typically involves extracting names and real-world objects (e.g. person names or monetary amounts) in natural language, based on the surrounding context. That's what an NER model implementation is going to be optimised for. There's not that much context here, not much actual text and many of the clues aren't actually in the raw text.
One approach that I've seen used in similar scenarios is completely reframing the task and treating it as a computer vision problem first: you train a model to predict bounding boxes for the total amounts. This lets you take more of the visual clues into account. And once you have the bounding box, you can use OCR to extract the text from it. I think if you search online for approaches for computer vision for invoice or resume parsing, you might find some papers that explain the idea. Could be worth a try and sounds more promising to me than a pure NLP approach.
What if im working with native text rather than tabular data, would spacy be of more help?
and if that were the case, any tips on how I can move forward with creating a pattern dictionary that has regex/rule-based/phrase matching to train custom ner for key value pair? I actually got this idea approach from Invoice Parsing using Spacy