How to prepare text for spacy pretrain

nix411 · June 27, 2019, 8:00pm

I am facing a challenge where I need to do some information extraction for earnings like this. E.g. I’d extract that net sales were $494 million in that example.

It will be a mix of using spaCy matchers and exploiting the dependency parser (maybe NER as well at some point). I have a data foundation of ~150k exchange statements whereas a fraction is in my interest (earnings statements), so I will train a classifier as well.

I noticed that you added spacy pretrain in the new spaCy 2.1 release and that could probably be useful for my usecase. But I am not sure if I should just segment my HTML reports into multiple paragraphs and save each paragraph to JSONL as raw text or how should I go ahead? Should I apply any filtering on my data to be used for training?

And just to be sure. The purpose of pretrain is outputting domain specific word vectors (using transfer learnings on pretrained vectors), right?

honnibal · July 1, 2019, 9:07pm

Splitting into paragraph-ish chunks seems reasonable, yes. For preprocessing, just make sure it's roughly the same as what you're doing during training and runtime. So, don't do something like masking currency words as --CURRENCY-- if that's not how you're processing the text at runtime. Same with filtering, although the pretrain will skip past very long inputs for efficiency.

Yes, domain-specific -- but more to the point, context-sensitive. You can train domain-specific vectors using word2vec or GloVe. The nice thing in spacy pretrain is that it gives you a context-sensitive representation too, so different words can have different word meanings. You might find the reply here useful: Fine-tuning mechanism and difference between pretrain and textcat? · Issue #3759 · explosion/spaCy · GitHub

Topic		Replies	Views
Teaching a spaCy model to attend to the right n-gram usage , spacy	4	2294	November 30, 2019
Train only the Tok2Vec Layer from within Prodigy and use it for further models	3	196	January 4, 2024
HTML Source Sentence Boundary Detection Prodigy usage , spacy	4	751	December 2, 2019
good configs for spacy pretraining usage , spacy	11	2614	November 22, 2022
Pre-Train Spacy NER for healthcare data usage , ner , spacy	1	1153	January 27, 2018

How to prepare text for spacy pretrain

Related topics