Label scheme for HTML tables with quite a few metrics to extract

nix411 · May 7, 2019, 7:39pm

Imagine a table like this

                        Q4 2018 Q4 2017
SEK b.                                 
Net sales                  63.8    57.9
Gross margin              25.7%   21.6%
Operating income (loss)    -1.9   -19.3

For a table like that I’d extract Net sales as NET_SALES and Operating income (loss) as EBIT. I also notice the currency is SEK and the multiplier is BILLION, Can prodigy be helpful somehow with all these annotations?

[
    {
        'metric': 'NET_SALES',
        'period': 'Q4',
        'amount': '63.8',
        'currency': 'SEK',
        'multiplier': 'BILLION',
    },
    {
        'metric': 'EBIT',
        'period': 'Q4',
        'amount': '-1.9',
        'currency': 'SEK',
        'multiplier': 'BILLION',
    },
]

honnibal · May 8, 2019, 12:32am

If you know that all your data is tabular, and you just want to annotate the text contents, you should probably develop a custom recipe. You should also probably not try to train a model using the standard components like the text classifier or the named entity recognizer, as your data isn’t primarily textual, and you’ll probably be better off using other approaches.

If you’ve got a lot of these tables to annotate, you probably want to check just how many total field names you have. I expect you can probably assume that if you’ve mapped Operating income (loss) to EDIT once, that’s always going to be the name. I’m sure you’re not going to hit one exceptional table where Operating income (loss) instead needs to be annotated NET_SALES.

Even if you have hundreds of thousands of these tables to annotate, if you extract the unique text field names, you might find you only have like 3000 of them. If so, you probably want to do context-independent annotations, and just hit “ignore” if you do hit a case where it’s unclear.

Finally, I wouldn’t worry about stuff like extracting SEK as currency and billions as the denomination in the annotations. You should just have a rule-based process for that.

nix411 · May 9, 2019, 11:30am

Yeah so the question is actually only on the labeling scheme and not on the modelling. I have a model already but I want to write heaps of unit tests to make sure that I’m not breaking my model when I make changes.

I was considering that that prodigy might be useful to get the correct answers to write these unit tests. The tests should include all fields like currency, multiplier, metric type, amount etc…

And of course a custom recipe for sure. I guess I’d run the recipe for each type of field I want to label then.

Topic		Replies	Views
Add Meta data for each class label during annotation usage , textcat , solved	5	924	May 27, 2020
Model and label strategy for information extraction task ner , textcat , spacy , finance	1	565	December 12, 2019
NER document Labeling ner , solved	25	3688	August 1, 2019
Fact extraction for earnings news usage , ner , textcat , best-practices , finance	6	5674	December 11, 2018
Documents annotations (from .pdf,.doc,.docx resumes) usage , ner , hr	4	1335	March 30, 2020

Label scheme for HTML tables with quite a few metrics to extract

Related topics