I believe a question like this has been posted before however im just looking for further clarifications.
Im trying to extract certain information from different invoice templates (all in pdf) such as invoice no, account no, date, total charges etc. Some of these data are in a table and some are just native.
Below are 2 screenshots from 2 different pdfs, an example where Id want to take the total amount due. However, I also realised that some fields that I wish to extract can have different naming conventions across different invoices, such as instead of invoice no, it can be bill no instead hence im not too sure if regexp will be of much help.
Im not sure if I should just use rule-based matching or train a statistical custom ner model.
If I do the former approach, should I create a key-value pair dictionary?
And if i do the latter approach, I have super limited datasets so Im really not too sure how I can go about training my model.
Any help or guidance would be greatly appreciated.