I have a task where I need to do some structured information extraction on exchange statements. So far I've been using a rule based approach using spacy matcher together with some sentence logic.
However I'd prefer migrating into a ML approach where I teach a model to attend to the tokens of interest. But I'm not sure if thats possible in spaCy? I love the spaCy framework so I'd prefer that.
Given an exchange statement like this I'd like to extract something like
OUTLOOK-EBIT_MARGIN which would be
around the same level as last year. I imagine having a list of n-grams with hundreds of features each and train a classifier and extract highest scoring n-gram. As features I'd have x-y coordinates, bold, italic etc.. The text itself is not enough since the information could be found in tables as well.
Let me know if this is something that can be done in spaCy or if you have some comments. Thank you.
I should add that the exchange statements can be very different and might not have nice headers like this one.