Why the choice of en_vectors_web_lg to categorize insults?

Hi Ines,

Why did you choose en_vectors_web_lg as your initial model? Were there advantages / disadvantages over en_core_web_lg?

Thanks in advance!

The en_vectors_web_lg model has 1.1 million unique vectors and en_core_web_lg only has 685 thousand.

For the initial task she was gathering similar terms, so having more vectors to choose from leads to better suggestions. I imagine that this is particularly useful for getting things like typo words, which may be pruned from the smaller 685 thousand vector model.

1 Like

Makes sense, thank you. So text categorization must only use word embeddings, not NER or the other parts of the en_core_web_lg model.

Sorry if this was a big confusing! But yes, @justindujardin's answer is correct.

Yes, I was mostly trying to get across that it usually makes sense to use the same vectors for bootstrapping the seeds and as the basis for the text classifier later on. Aside from that, all model components like the entity recognizer, parser, tagger etc. are independent in spaCy v2.x and don't rely on each other – which is also why you can mix and match them.