TextCat outcome depends on words that are not in the vocabulary

I am doing default Spacy TextCat on mainly sentences with company names in them. I thought about pretraining or something similar in order to have the model "understand" the company names before the TextCat training. As a preparation I looked at how the TextCat prediction depends on the specific company name. I did this by feeding the same sentence with only the company name (1 word) altered to the model and I found something strange: The TextCat prediction depends critically on the company name, also for names that do not appear in the training data. If I try different "nonsense" business names, that are not in the vocabulary (web_en_core_lg) then it gives different outcomes. I think I am missing something crucial here. What is the "real input" for the TextCat and why does it depend on random words that it does not know?

All of spaCy's models use hash embeddings: Can you explain how exactly HashEmbed works ?

This embedding strategy is used to avoid having to initialize with a fixed-size vocabulary at the beginning of training. Instead, there is no specific bound on the number of words the model is able to learn. New words will continue to influence the training.

Incidentally, even without the hash embeddings most models would learn from unseen words: pretty much all models will have some vector for unknown words that's being updated. Most models also have subword features, which will cause the model to be updated from unknown words.

Hi Honnibal,

Thank you for your answer. Do I understand correctly that not only in training but also when we use a previously trained model, unknown words are essentially mapped to random vectors? These vectors might randomly be associated strongly to a TextCat label. In practice I should first erase all words in a sentence that are not in the vocabulary to avoid the random effect of these associations? Does the same apply to words that do not have a vector ascribed?



Hi Bart,

I wouldn't erase the unknown words, no. The context around each word is used by the CNN, so you might be messing up the structure by removing the unknown words. The model also uses subword features (first character, suffix, word shape) to help derive the vector for each word. Layer normalization is also used to prevent large magnitudes.

The model has been designed to deal with unknown words in a reasonable way, as this is an expected thing that any NLP application will have to deal with. You can always do live updates of the model as well, to teach it new words.