Model and label strategy for information extraction task

nix411 · December 11, 2019, 1:41pm

I have earlier made a post about information extraction on exchange statements. So far it has been successful using the spaCy matchers but now I want to add some more ML into it since the current model does have its limitations. I first want to be a bit thorough in my explanation so sorry for a long post here.

Input

My input data could be something like (this one has new headlines and bullet points which is not always the case!)

Second quarter - 1 August to 31 October 2019

Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%

Online sales increased 21% to 115 MSEK (95)

Operating profit, including non-recurring costs and costs linked to the CO100+ action programme, totalled 133 MSEK (33). Excluding the effect of IFRS 16, operating profit amounted to 107 MSEK

Underlying operating profit amounted to 153 MSEK (124) (excl IFRS 16)

The operating margin was 6.1% (1.6)

Net debt/EBITDA excluding the effect of IFRS 16 (12 months) amounted to 0.6 times (0.5)

Profit after tax totalled 91 MSEK (25)

Earnings per share amounted to 1.43 SEK (0.40)

A partnership with Kolonial.no, Norway’s largest online food retailer, started in September

Six months - 1 May to 31 October 2019

Sales in the Nordics increased by 4% to 4,157 MSEK (3,978), organic growth up 4% and total sales increased 2% to 4,209 MSEK (4,115), organic growth up 2%

Operating profit, including non-recurring items and costs linked to CO100+ action programme, totalled 212 MSEK (65). Excluding the effect of IFRS 16, operating profit amounted to 159 MSEK

Underlying operating profit amounted to 244 MSEK (186) (excl IFRS 16)

The operating margin was 5.0% (1.6)

Profit after tax totalled 139 MSEK (55)

Earnings per share amounted to 2.21 SEK (0.87)

Output

From this I want to extract several metrics. E.g. a EARNINGS payload

{'period': 'Q2', 'metric': 'SALES', 'amount': 2165000, 'currency': 'SEK', 'year': 'CURRENT'}

The information is found in this line

Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%

plus the context of that sentence. Notice that Sales in the Nordics is not of interest.

Proposed model(s)

My idea is to have two types of models

Local text models that run on smaller paragraphs.
Context classifiers that determines the context.

Local text models

The input for the local text models would look like the following

Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%

which should produce an output like

{'period': None, 'metric': 'SALES', 'amount': 2165000, 'currency': 'SEK', 'year': None}

This works pretty good using the matchers, especially determining the currency, multiplier and metric. But to determine the correct amount I imagine training a NER model that locates CURRENT_AMOUNT (current since it should ignore if previous years numbers are mentioned), in this case 2,165.

Context classifiers

The purpose of the context classifiers is to fill in the blanks for each CURRENT_AMOUNT token/span, i.e. determine

year: [PREVIOUS, CURRENT, NEXT]
period: [Q1, Q2, Q3, Q4, HALF-YEAR, NINE-MONTHS, FULL-YEAR, None]
type_of_content: [EARNINGS, OUTLOOK, OTHER]

For this I imagine a feature space where the context up until a sentence is transformed into a vector (weighing possible headlines more than the rest) plus transforming the sentence into a vector and then run three classifiers on that.

For the period I could add MONTH_SPAN as well and then transform that into eg. Q1 which is fine.

Labelling

Now finally to my question. Can I use prodigy effectively for this kind of task?

As I see it I have four labelling tasks

ner labelling for finding CURRENT_AMOUNT given paragraphs.
textcat (or just cat?) for period, year and type_of_content for each CURRENT_AMOUNT

My first challenge though is that the majority of the documents does not have CURRENT_AMOUNT so its tough starting from scratch. Instead I plan to label a whole lot using patterns which would have 50% precision, pretrain a model and then do ner.teach using that model and thereby start collecting data. Would that work?

For the classifier labelling I simply plan to classify for each CURRENT_AMOUNT which should be just fine.

Again I'm sorry for the rather long story here but I hope it makes it clear at least. I'm looking forward to hear any comments as well as answers to my questions regarding the NER labelling strategy.

honnibal · December 12, 2019, 11:14am

I think the approach you're proposing sounds reasonable, but in the end every information extraction project is a bit different, because the trick is to exploit the regularities in how the information is explained in the data. A lot of the academic work comes at the task in a "fully general solution" sort of way, which requires a lot of annotation and effort. But on any problem, the information is usually presented in a pretty constrained manner, making a combination of ML and rule-based approaches quite effective. But the specifics of that are a bit different every time --- so it's hard to make our advice too concrete, unfortunately.

Topic		Replies	Views
Fact extraction for earnings news usage , ner , textcat , best-practices , finance	6	5662	December 11, 2018
Teaching a spaCy model to attend to the right n-gram usage , spacy	4	2289	November 30, 2019
NER for short unstructured text, what am I doing wrong? ner	12	1376	November 27, 2018
Invoice Parsing using Spacy usage , ner , spacy	11	9367	June 22, 2023
Framing NER task as a text classification task usage , ner , textcat	5	632	December 19, 2019