How to prepare text for spacy pretrain

I am facing a challenge where I need to do some information extraction for earnings like this. E.g. I’d extract that net sales were $494 million in that example.

It will be a mix of using spaCy matchers and exploiting the dependency parser (maybe NER as well at some point). I have a data foundation of ~150k exchange statements whereas a fraction is in my interest (earnings statements), so I will train a classifier as well.

I noticed that you added spacy pretrain in the new spaCy 2.1 release and that could probably be useful for my usecase. But I am not sure if I should just segment my HTML reports into multiple paragraphs and save each paragraph to JSONL as raw text or how should I go ahead? Should I apply any filtering on my data to be used for training?

And just to be sure. The purpose of pretrain is outputting domain specific word vectors (using transfer learnings on pretrained vectors), right?

Splitting into paragraph-ish chunks seems reasonable, yes. For preprocessing, just make sure it's roughly the same as what you're doing during training and runtime. So, don't do something like masking currency words as --CURRENCY-- if that's not how you're processing the text at runtime. Same with filtering, although the pretrain will skip past very long inputs for efficiency.

Yes, domain-specific -- but more to the point, context-sensitive. You can train domain-specific vectors using word2vec or GloVe. The nice thing in spacy pretrain is that it gives you a context-sensitive representation too, so different words can have different word meanings. You might find the reply here useful: Fine-tuning mechanism and difference between pretrain and textcat? · Issue #3759 · explosion/spaCy · GitHub