I am facing a challenge where I need to do some information extraction for earnings like this. E.g. I’d extract that net sales were $494 million in that example.
It will be a mix of using spaCy matchers and exploiting the dependency parser (maybe NER as well at some point). I have a data foundation of ~150k exchange statements whereas a fraction is in my interest (earnings statements), so I will train a classifier as well.
I noticed that you added
spacy pretrain in the new spaCy 2.1 release and that could probably be useful for my usecase. But I am not sure if I should just segment my HTML reports into multiple paragraphs and save each paragraph to JSONL as raw text or how should I go ahead? Should I apply any filtering on my data to be used for training?
And just to be sure. The purpose of pretrain is outputting domain specific word vectors (using transfer learnings on pretrained vectors), right?