Invoice Parsing using Spacy

ashntu · August 5, 2018, 8:23am

I’m trying to extract semantic information from invoices. After a long list of searching, I found that spacy can be very useful for my problem domain. I want to extract 6-8 fields which are commonly found in a invoice. For simplicity I am just taking example to extracting date field.
From the sample invoice template I have found these format for date field:

Here, we can see that dates are represented in various format, and sometimes there are more than two date fields and sometime it may be placed below in tabular format.

My approach is: I am training two NERs for key, value pair (date_id, date) to identify the date on which invoice was issued. date_id can be in following the syntactic form: “Invoice date”, “Date:”, Billing, “This itinerary was generated on”, etc and date in all kind of date format.

After identifying these two entities I am doing dependency parsing to identify the correct date. Another approach can be based on the relative spatial distance between these two entities.

Pardon me if I am very naive in my approach and let me know if it makes sense or if you have any other approach which can solve my problem.

honnibal · August 8, 2018, 1:29pm

I think the idea of identifying the trigger phrase (like “invoice date”) as well as the date itself is a good idea. You might want to learn on the matcher rules heavily for this task as well, as I think that’ll perform quite well for you. The trick is to make sure you’ve got a good evaluation set, so you can verify accuracy as you develop your rules. This would also let you benchmark the statistical NER against your rule-based process, to check whether you’re catching cases your rules don’t cover. If you are, you might want to adjust the rules.

Here’s one way to think about it: The statistical model can help you resolve cases which have difficult ambiguities, based on the context of the phrase. In this situation, you really need the statistical model, and rules wouldn’t work well. But there are other situations where the statistical model simply generalises a bit better than your initial rules, and discovers new unambiguous cases. If so, you can just go ahead and update your rules, to cover the new cases. This way you know you’ll be recognising that pattern reliably, while the statistical model might encounter a context where it gets the case wrong.

I would also probably avoid trying to use the dependency parser. I think its behaviour will be unreliable for your use-case. You’d be better off using the ordering of the elements to figure out the date given the trigger phrase.

ashntu · August 10, 2018, 4:02am

Thanks Honnibal for your reply. It was really helpful to proceed further in my approach.
I am few queries, sorry if I am being too naive:
From your reply I understood that I should compare with two approaches to the problem: Statistical NER vs Rule-based process based on matcher rule.

Matcher rule: How to write the matcher rule, is it based on some explicit format like regular expression.

Statistical NER: Is it by training on the spacy model for my custom NER. And this can resolve cases with Context of the phrase.

Correct me if I am wrong.

Also I would like your suggestion on creating training/testing/validation set.
In my case: I have 6-8 custom NERs per invoice.
1st Approach: I train all the NERs on my first invoice and then on next invoice and so on…
2nd Approach: I put all the phrases for one type of NERs in a single file and train the model. And subsequently train for next NER. In this case I can also train one model per NER. I am not sure if this will be advisable.

Really appreciate your valuable input on this.

honnibal · August 10, 2018, 1:50pm

Yes, it's a JSON-format with a few quantifiers etc, sort of like a regular expression -- but defined over tokens, rather than the raw string.

Yes, that's correct.

This sounds correct.

This sounds wrong. The NER model has to see the phrases in context. You can't train the model on just a list of phrases --- that's not how it works.

veonua · January 22, 2019, 2:27pm

Is there way to use spatial information in Spacy,
the text of invoice is usually in columns as on images above, and OCR often merge several columns to one paragraph. I have blocks position from OCR but it seems it is impossible to use this information in current implementation

honnibal · January 25, 2019, 9:42pm

It’s true that we don’t have a really satisfying way of incorporating arbitrary features into the models at the moment.

As a crude solution, you could include the position features as extra tokens into the text, either before or after. You could also try training different classifiers for the different positions, if you have enough training data.

If neither of those work, you can also create a custom text classification model, by subclassing the TextCategorizer class in spaCy, and overwriting the TextCategorizer.Model class method.

sarthaksinha31 · October 24, 2019, 12:15pm

Hi @honnibal is it possible if we use datetime parser module for date parsing

angelquisit · May 12, 2020, 9:05am

hi Ashntu,

Im trying to tackle the exact same isse that you are facing, could you advise how you proceeded on to do it?

I got a few questions id like to ask,

how did you train the NERs for key, value pair
did you ended up using just statistical model or did you integrate the rule-based approach as well?

jravur · February 3, 2021, 12:35am

hey guys I'm trying to do the same, can anyone help me with how to get started? will be really helpful. Thanks

ankitladva11 · June 21, 2023, 1:39pm

Hey Guys can any body help me in this task ?

ryanwesslen · June 21, 2023, 3:22pm

hi @ankitladva11!

If your invoices are pdfs, my colleague @ljvmiranda921 wrote an excellent and detailed blog on creating a document processing workflow with Prodigy and fine-tuning HuggingFace's LayoutLMv3:

prodigy_correct (1)

Also, a lot of Matt's suggestions require a good knowledge of spaCy. If your questions are spaCy-specific, you'll be better off posting questions on the spaCy GitHub discussions forum. That's where the spaCy core team answers questions and they have answered several related posts on invoice parsing:

Hope this helps!

ankitladva11 · June 22, 2023, 10:07am

Thank You for sharing it , Im dealing with imagrs of retail supermarkets bill , if you have specific resources for that please share it ,Thank you again

Topic		Replies	Views
Invoice Parsing usage , ner , spacy	3	989	May 14, 2020
Correct way to annotate data in my case (Spacy newbie here) usage , ner , spacy	1	582	October 29, 2020
Custom NER model usage , ner , spacy	6	1403	April 15, 2019
Questionable results from NER - we must be doing something wrong ner , spacy , best-practices , legal	5	4343	August 30, 2018
Extracting numeric token for several entities in order using Spacy usage , spacy , off-topic	0	717	October 1, 2020

Invoice Parsing using Spacy

Related topics