NER for Invoices - Solving for scale in Enterprise application - Possible partnership

Hi Matthew and Ines,

I am Soubhagya Mohapatra, a Product Manager for AI and Analytics offerings in Accenture. In our group, we solve and scale many real-world problems in business processes in F&A, Healthcare, FS, HR, Procurement and Supply Chain domains.

In an effort to extract information from a wide variety of invoices (both line and header details), we had deployed a CRF model which did pretty well. However, we realized that the model was not scalable to newer templates and invoices from a different industry. Another challenge we faced on the way is to annotate and gather different invoice templates in reasonable numbers for every group of invoice templates. We typically convert an image to digitized PDF, then to HTML using PDF2HTMLX, annotate using WebAnnotator (a Firefox PlugIn - works with older versions) and then use in modeling. Apart from standard features of word shapes and locational info, we rely on HTML DIV information to a great extent in our modeling.

Recently, I came across Prodigy, which I wonder if can solve our 2nd problem about annotation. We were thinking if we could internally build a layer between our current data model and output of Prodigy… That way, we can have the users start from a blank slate of annotated data and as and when they rubber-band some content from the invoice or select/copy a field, we can tag those locations and DIVs to be passed on to our model. This way we can build on the gathered data and pass it to our CRF model at a specified frequency to possibly improve the model accuracy.

I would like to understand whether such support can be provided by Prodigy and whether technical integration can be achieved reasonably quickly. There is a huge value to be realized if we can attend this integration. We are happy to discuss this separately with you over an email or conference… My email is and will be happy to take this forward if you think this is feasible.

A few more points about our annotation use case… It is not a Yes/No type case. Rather, every field will be present in every document, which the user needs to find and annotate. We have 8-10 header fields (Supplier Name/Address, Bill To Name/Address, Invoice Date, Invoice Number, Total Amount, Tax Amount, Currency, etc) and 5 sets of line fields (Description, Amount, Units, Unit Price, etc). Secondly, our current model uses both mentions and values in modeling, which means the annotation includes both mentions and values (an example is the annotator will annotate “Inv. No:” as the mention of “Invoice Number” and “12345” as the value of “Invoice Number”.

Looking forward to hearing from you!


Hi Sou,

We actually have a number of people within Accenture already using Prodigy successfully on different projects, but I understand it's a big company, so you probably aren't working with any of the people involved.

I do think your project should be feasible with Prodigy, but unfortunately we don't currently have any bandwidth for customizations, partnerships or consulting.

One thing I would advise though:

I think you should avoid having annotation tasks that involve so many separate annotations. Instead of making one pass over the text with 8-10 different fields, you should restructure it so that you make 8-10 passes over the text, with one annotation at a time. This will be both faster and more accurate.

1 Like