Model and label strategy for information extraction task

I have earlier made a post about information extraction on exchange statements. So far it has been successful using the spaCy matchers but now I want to add some more ML into it since the current model does have its limitations. I first want to be a bit thorough in my explanation so sorry for a long post here.

:newspaper: Input

My input data could be something like (this one has new headlines and bullet points which is not always the case!)

Second quarter - 1 August to 31 October 2019

  • Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%
  • Online sales increased 21% to 115 MSEK (95)
  • Operating profit, including non-recurring costs and costs linked to the CO100+ action programme, totalled 133 MSEK (33). Excluding the effect of IFRS 16, operating profit amounted to 107 MSEK
  • Underlying operating profit amounted to 153 MSEK (124) (excl IFRS 16)
  • The operating margin was 6.1% (1.6)
  • Net debt/EBITDA excluding the effect of IFRS 16 (12 months) amounted to 0.6 times (0.5)
  • Profit after tax totalled 91 MSEK (25)
  • Earnings per share amounted to 1.43 SEK (0.40)
  • A partnership with Kolonial.no, Norway’s largest online food retailer, started in September

Six months - 1 May to 31 October 2019

  • Sales in the Nordics increased by 4% to 4,157 MSEK (3,978), organic growth up 4% and total sales increased 2% to 4,209 MSEK (4,115), organic growth up 2%
  • Operating profit, including non-recurring items and costs linked to CO100+ action programme, totalled 212 MSEK (65). Excluding the effect of IFRS 16, operating profit amounted to 159 MSEK
  • Underlying operating profit amounted to 244 MSEK (186) (excl IFRS 16)
  • The operating margin was 5.0% (1.6)
  • Profit after tax totalled 139 MSEK (55)
  • Earnings per share amounted to 2.21 SEK (0.87)

:package: Output

From this I want to extract several metrics. E.g. a EARNINGS payload

{'period': 'Q2', 'metric': 'SALES', 'amount': 2165000, 'currency': 'SEK', 'year': 'CURRENT'}

The information is found in this line

Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%

plus the context of that sentence. Notice that Sales in the Nordics is not of interest.

:rocket: Proposed model(s)

My idea is to have two types of models

  1. Local text models that run on smaller paragraphs.
  2. Context classifiers that determines the context.

:mag_right: Local text models

The input for the local text models would look like the following

Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%

which should produce an output like

{'period': None, 'metric': 'SALES', 'amount': 2165000, 'currency': 'SEK', 'year': None}

This works pretty good using the matchers, especially determining the currency, multiplier and metric. But to determine the correct amount I imagine training a NER model that locates CURRENT_AMOUNT (current since it should ignore if previous years numbers are mentioned), in this case 2,165.

:artificial_satellite: Context classifiers

The purpose of the context classifiers is to fill in the blanks for each CURRENT_AMOUNT token/span, i.e. determine

year: [PREVIOUS, CURRENT, NEXT]
period: [Q1, Q2, Q3, Q4, HALF-YEAR, NINE-MONTHS, FULL-YEAR, None]
type_of_content: [EARNINGS, OUTLOOK, OTHER] 

For this I imagine a feature space where the context up until a sentence is transformed into a vector (weighing possible headlines more than the rest) plus transforming the sentence into a vector and then run three classifiers on that.

For the period I could add MONTH_SPAN as well and then transform that into eg. Q1 which is fine.

:paintbrush: Labelling

Now finally to my question. Can I use prodigy effectively for this kind of task?

As I see it I have four labelling tasks

  • ner labelling for finding CURRENT_AMOUNT given paragraphs.
  • textcat (or just cat?) for period, year and type_of_content for each CURRENT_AMOUNT

My first challenge though is that the majority of the documents does not have CURRENT_AMOUNT so its tough starting from scratch. Instead I plan to label a whole lot using patterns which would have 50% precision, pretrain a model and then do ner.teach using that model and thereby start collecting data. Would that work?

For the classifier labelling I simply plan to classify for each CURRENT_AMOUNT which should be just fine.

Again I'm sorry for the rather long story here but I hope it makes it clear at least. I'm looking forward to hear any comments as well as answers to my questions regarding the NER labelling strategy.

I think the approach you're proposing sounds reasonable, but in the end every information extraction project is a bit different, because the trick is to exploit the regularities in how the information is explained in the data. A lot of the academic work comes at the task in a "fully general solution" sort of way, which requires a lot of annotation and effort. But on any problem, the information is usually presented in a pretty constrained manner, making a combination of ML and rule-based approaches quite effective. But the specifics of that are a bit different every time --- so it's hard to make our advice too concrete, unfortunately.