I have earlier made a post about information extraction on exchange statements. So far it has been successful using the spaCy matchers but now I want to add some more ML into it since the current model does have its limitations. I first want to be a bit thorough in my explanation so sorry for a long post here.
Input
My input data could be something like (this one has new headlines and bullet points which is not always the case!)
Second quarter - 1 August to 31 October 2019
- Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%
- Online sales increased 21% to 115 MSEK (95)
- Operating profit, including non-recurring costs and costs linked to the CO100+ action programme, totalled 133 MSEK (33). Excluding the effect of IFRS 16, operating profit amounted to 107 MSEK
- Underlying operating profit amounted to 153 MSEK (124) (excl IFRS 16)
- The operating margin was 6.1% (1.6)
- Net debt/EBITDA excluding the effect of IFRS 16 (12 months) amounted to 0.6 times (0.5)
- Profit after tax totalled 91 MSEK (25)
- Earnings per share amounted to 1.43 SEK (0.40)
- A partnership with Kolonial.no, Norway’s largest online food retailer, started in September
Six months - 1 May to 31 October 2019
- Sales in the Nordics increased by 4% to 4,157 MSEK (3,978), organic growth up 4% and total sales increased 2% to 4,209 MSEK (4,115), organic growth up 2%
- Operating profit, including non-recurring items and costs linked to CO100+ action programme, totalled 212 MSEK (65). Excluding the effect of IFRS 16, operating profit amounted to 159 MSEK
- Underlying operating profit amounted to 244 MSEK (186) (excl IFRS 16)
- The operating margin was 5.0% (1.6)
- Profit after tax totalled 139 MSEK (55)
- Earnings per share amounted to 2.21 SEK (0.87)
Output
From this I want to extract several metrics. E.g. a EARNINGS
payload
{'period': 'Q2', 'metric': 'SALES', 'amount': 2165000, 'currency': 'SEK', 'year': 'CURRENT'}
The information is found in this line
Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%
plus the context of that sentence. Notice that Sales in the Nordics
is not of interest.
Proposed model(s)
My idea is to have two types of models
- Local text models that run on smaller paragraphs.
- Context classifiers that determines the context.
Local text models
The input for the local text models would look like the following
Sales in the Nordics increased by 3% to 2,145 MSEK (2,089), organic growth up 3% and total sales were unchanged at 2,165 MSEK (2,157), organic growth up 1%
which should produce an output like
{'period': None, 'metric': 'SALES', 'amount': 2165000, 'currency': 'SEK', 'year': None}
This works pretty good using the matchers, especially determining the currency, multiplier and metric. But to determine the correct amount I imagine training a NER
model that locates CURRENT_AMOUNT
(current since it should ignore if previous years numbers are mentioned), in this case 2,165
.
Context classifiers
The purpose of the context classifiers is to fill in the blanks for each CURRENT_AMOUNT
token/span, i.e. determine
year: [PREVIOUS, CURRENT, NEXT]
period: [Q1, Q2, Q3, Q4, HALF-YEAR, NINE-MONTHS, FULL-YEAR, None]
type_of_content: [EARNINGS, OUTLOOK, OTHER]
For this I imagine a feature space where the context up until a sentence is transformed into a vector (weighing possible headlines more than the rest) plus transforming the sentence into a vector and then run three classifiers on that.
For the period I could add MONTH_SPAN
as well and then transform that into eg. Q1
which is fine.
Labelling
Now finally to my question. Can I use prodigy effectively for this kind of task?
As I see it I have four labelling tasks
ner
labelling for findingCURRENT_AMOUNT
given paragraphs.textcat
(or justcat
?) forperiod
,year
andtype_of_content
for eachCURRENT_AMOUNT
My first challenge though is that the majority of the documents does not have CURRENT_AMOUNT
so its tough starting from scratch. Instead I plan to label a whole lot using patterns which would have 50% precision, pretrain a model and then do ner.teach
using that model and thereby start collecting data. Would that work?
For the classifier labelling I simply plan to classify for each CURRENT_AMOUNT
which should be just fine.
Again I'm sorry for the rather long story here but I hope it makes it clear at least. I'm looking forward to hear any comments as well as answers to my questions regarding the NER labelling strategy.