Invoice Parsing using Spacy

I’m trying to extract semantic information from invoices. After a long list of searching, I found that spacy can be very useful for my problem domain. I want to extract 6-8 fields which are commonly found in a invoice. For simplicity I am just taking example to extracting date field.
From the sample invoice template I have found these format for date field:


Here, we can see that dates are represented in various format, and sometimes there are more than two date fields and sometime it may be placed below in tabular format.

My approach is: I am training two NERs for key, value pair (date_id, date) to identify the date on which invoice was issued. date_id can be in following the syntactic form: “Invoice date”, “Date:”, Billing, “This itinerary was generated on”, etc and date in all kind of date format.

After identifying these two entities I am doing dependency parsing to identify the correct date. Another approach can be based on the relative spatial distance between these two entities.

Pardon me if I am very naive in my approach and let me know if it makes sense or if you have any other approach which can solve my problem.

I think the idea of identifying the trigger phrase (like “invoice date”) as well as the date itself is a good idea. You might want to learn on the matcher rules heavily for this task as well, as I think that’ll perform quite well for you. The trick is to make sure you’ve got a good evaluation set, so you can verify accuracy as you develop your rules. This would also let you benchmark the statistical NER against your rule-based process, to check whether you’re catching cases your rules don’t cover. If you are, you might want to adjust the rules.

Here’s one way to think about it: The statistical model can help you resolve cases which have difficult ambiguities, based on the context of the phrase. In this situation, you really need the statistical model, and rules wouldn’t work well. But there are other situations where the statistical model simply generalises a bit better than your initial rules, and discovers new unambiguous cases. If so, you can just go ahead and update your rules, to cover the new cases. This way you know you’ll be recognising that pattern reliably, while the statistical model might encounter a context where it gets the case wrong.

I would also probably avoid trying to use the dependency parser. I think its behaviour will be unreliable for your use-case. You’d be better off using the ordering of the elements to figure out the date given the trigger phrase.

Thanks Honnibal for your reply. It was really helpful to proceed further in my approach.
I am few queries, sorry if I am being too naive:
From your reply I understood that I should compare with two approaches to the problem: Statistical NER vs Rule-based process based on matcher rule.

Matcher rule: How to write the matcher rule, is it based on some explicit format like regular expression.

Statistical NER: Is it by training on the spacy model for my custom NER. And this can resolve cases with Context of the phrase.

Correct me if I am wrong.

Also I would like your suggestion on creating training/testing/validation set.
In my case: I have 6-8 custom NERs per invoice.
1st Approach: I train all the NERs on my first invoice and then on next invoice and so on…
2nd Approach: I put all the phrases for one type of NERs in a single file and train the model. And subsequently train for next NER. In this case I can also train one model per NER. I am not sure if this will be advisable.

Really appreciate your valuable input on this.

Yes, it’s a JSON-format with a few quantifiers etc, sort of like a regular expression – but defined over tokens, rather than the raw string.

Yes, that’s correct.

This sounds correct.

This sounds wrong. The NER model has to see the phrases in context. You can’t train the model on just a list of phrases — that’s not how it works.

Is there way to use spatial information in Spacy,
the text of invoice is usually in columns as on images above, and OCR often merge several columns to one paragraph. I have blocks position from OCR but it seems it is impossible to use this information in current implementation

It’s true that we don’t have a really satisfying way of incorporating arbitrary features into the models at the moment.

As a crude solution, you could include the position features as extra tokens into the text, either before or after. You could also try training different classifiers for the different positions, if you have enough training data.

If neither of those work, you can also create a custom text classification model, by subclassing the TextCategorizer class in spaCy, and overwriting the TextCategorizer.Model class method.

Hi @honnibal is it possible if we use datetime parser module for date parsing