TextCat outcome depends on words that are not in the vocabulary

honnibal · February 1, 2020, 11:15am

All of spaCy's models use hash embeddings: Can you explain how exactly HashEmbed works ?

This embedding strategy is used to avoid having to initialize with a fixed-size vocabulary at the beginning of training. Instead, there is no specific bound on the number of words the model is able to learn. New words will continue to influence the training.

Incidentally, even without the hash embeddings most models would learn from unseen words: pretty much all models will have some vector for unknown words that's being updated. Most models also have subword features, which will cause the model to be updated from unknown words.

Topic		Replies	Views
Categorisation in foreign languages - are word vectors enough? textcat , spacy	2	630	May 3, 2019
Do the outputted models using textcat.batch-train make use of word vectors? usage , textcat , spacy	2	595	March 28, 2019
Out-of-vocabulary new NER model ner , spacy , solved	2	1276	September 15, 2018
Labeling & Training a Textcat with Contextual / Anchor Data usage , textcat	3	465	November 13, 2020
Textcat possible problem with uneven dataset? usage , textcat , done	2	956	January 17, 2020

TextCat outcome depends on words that are not in the vocabulary

Related topics