I have earlier made a post about information extraction on exchange statements. So far it has been successful using the spaCy matchers but now I want to add some more ML into it since the current model does have its limitations. I first want to be a bit thorough in my explanation so sorry for a long post here.
Input
My input data could be something like (this one has new headlines and bullet points which is not always the case!)
Second quarter - 1 August to 31 October 2019
- Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%
- Online sales increased 21% to 115 MSEK (95)
- Operating profit, including non-recurring costs and costs linked to the CO100+ action programme, totalled 133 MSEK (33). Excluding the effect of IFRS 16, operating profit amounted to 107 MSEK
- Underlying operating profit amounted to 153 MSEK (124) (excl IFRS 16)
- The operating margin was 6.1% (1.6)
- Net debt/EBITDA excluding the effect of IFRS 16 (12 months) amounted to 0.6 times (0.5)
- Profit after tax totalled 91 MSEK (25)
- Earnings per share amounted to 1.43 SEK (0.40)
- A partnership with Kolonial.no, Norway’s largest online food retailer, started in September
Six months - 1 May to 31 October 2019
- Sales in the Nordics increased by 4% to 4,157 MSEK (3,978), organic growth up 4% and total sales increased 2% to 4,209 MSEK (4,115), organic growth up 2%
- Operating profit, including non-recurring items and costs linked to CO100+ action programme, totalled 212 MSEK (65). Excluding the effect of IFRS 16, operating profit amounted to 159 MSEK
- Underlying operating profit amounted to 244 MSEK (186) (excl IFRS 16)
- The operating margin was 5.0% (1.6)
- Profit after tax totalled 139 MSEK (55)
- Earnings per share amounted to 2.21 SEK (0.87)
Output
From this I want to extract several metrics. E.g. a EARNINGS payload
{'period': 'Q2', 'metric': 'SALES', 'amount': 2165000, 'currency': 'SEK', 'year': 'CURRENT'}
The information is found in this line
Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%
plus the context of that sentence. Notice that Sales in the Nordics is not of interest.
Proposed model(s)
My idea is to have two types of models
- Local text models that run on smaller paragraphs.
- Context classifiers that determines the context.
Local text models
The input for the local text models would look like the following
Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%
which should produce an output like
{'period': None, 'metric': 'SALES', 'amount': 2165000, 'currency': 'SEK', 'year': None}
This works pretty good using the matchers, especially determining the currency, multiplier and metric. But to determine the correct amount I imagine training a NER model that locates CURRENT_AMOUNT (current since it should ignore if previous years numbers are mentioned), in this case 2,165.
Context classifiers
The purpose of the context classifiers is to fill in the blanks for each CURRENT_AMOUNT token/span, i.e. determine
year: [PREVIOUS, CURRENT, NEXT]
period: [Q1, Q2, Q3, Q4, HALF-YEAR, NINE-MONTHS, FULL-YEAR, None]
type_of_content: [EARNINGS, OUTLOOK, OTHER]
For this I imagine a feature space where the context up until a sentence is transformed into a vector (weighing possible headlines more than the rest) plus transforming the sentence into a vector and then run three classifiers on that.
For the period I could add MONTH_SPAN as well and then transform that into eg. Q1 which is fine.
Labelling
Now finally to my question. Can I use prodigy effectively for this kind of task?
As I see it I have four labelling tasks
nerlabelling for findingCURRENT_AMOUNTgiven paragraphs.textcat(or justcat?) forperiod,yearandtype_of_contentfor eachCURRENT_AMOUNT
My first challenge though is that the majority of the documents does not have CURRENT_AMOUNT so its tough starting from scratch. Instead I plan to label a whole lot using patterns which would have 50% precision, pretrain a model and then do ner.teach using that model and thereby start collecting data. Would that work?
For the classifier labelling I simply plan to classify for each CURRENT_AMOUNT which should be just fine.
Again I'm sorry for the rather long story here but I hope it makes it clear at least. I'm looking forward to hear any comments as well as answers to my questions regarding the NER labelling strategy.