I’m trying to extract semantic information from invoices. After a long list of searching, I found that spacy can be very useful for my problem domain. I want to extract 6-8 fields which are commonly found in a invoice. For simplicity I am just taking example to extracting date field.
From the sample invoice template I have found these format for date field:
Here, we can see that dates are represented in various format, and sometimes there are more than two date fields and sometime it may be placed below in tabular format.
My approach is: I am training two NERs for key, value pair (date_id, date) to identify the date on which invoice was issued. date_id can be in following the syntactic form: “Invoice date”, “Date:”, Billing, “This itinerary was generated on”, etc and date in all kind of date format.
After identifying these two entities I am doing dependency parsing to identify the correct date. Another approach can be based on the relative spatial distance between these two entities.
Pardon me if I am very naive in my approach and let me know if it makes sense or if you have any other approach which can solve my problem.