Why the choice of en_vectors_web_lg to categorize insults?

TJA · September 9, 2018, 10:26pm

Hi Ines,

Why did you choose en_vectors_web_lg as your initial model? Were there advantages / disadvantages over en_core_web_lg?

Thanks in advance!

justindujardin · September 11, 2018, 10:41pm

The en_vectors_web_lg model has 1.1 million unique vectors and en_core_web_lg only has 685 thousand.

For the initial task she was gathering similar terms, so having more vectors to choose from leads to better suggestions. I imagine that this is particularly useful for getting things like typo words, which may be pruned from the smaller 685 thousand vector model.

TJA · September 12, 2018, 8:40am

Makes sense, thank you. So text categorization must only use word embeddings, not NER or the other parts of the en_core_web_lg model.

ines · September 12, 2018, 9:38am

Sorry if this was a big confusing! But yes, @justindujardin's answer is correct.

Yes, I was mostly trying to get across that it usually makes sense to use the same vectors for bootstrapping the seeds and as the basis for the text classifier later on. Aside from that, all model components like the entity recognizer, parser, tagger etc. are independent in spaCy v2.x and don't rely on each other – which is also why you can mix and match them.

Topic		Replies	Views
Do word vectors have effect on NER accuracy? usage , ner , spacy	1	1133	March 27, 2018
Similar models to en_core_web_lg/en_vectors_web_lg usage , spacy	5	1182	February 25, 2021
Categorisation in foreign languages - are word vectors enough? textcat , spacy	2	590	May 3, 2019
German NER model usage , spacy	3	714	November 26, 2020
Problems related with Spacy pretrain with customized vector models textcat , spacy	4	1056	September 24, 2019

Why the choice of en_vectors_web_lg to categorize insults?

Related Topics