TextCat outcome depends on words that are not in the vocabulary

bartclauwens · January 23, 2020, 1:26pm

I am doing default Spacy TextCat on mainly sentences with company names in them. I thought about pretraining or something similar in order to have the model "understand" the company names before the TextCat training. As a preparation I looked at how the TextCat prediction depends on the specific company name. I did this by feeding the same sentence with only the company name (1 word) altered to the model and I found something strange: The TextCat prediction depends critically on the company name, also for names that do not appear in the training data. If I try different "nonsense" business names, that are not in the vocabulary (web_en_core_lg) then it gives different outcomes. I think I am missing something crucial here. What is the "real input" for the TextCat and why does it depend on random words that it does not know?

honnibal · February 1, 2020, 11:15am

All of spaCy's models use hash embeddings: Can you explain how exactly HashEmbed works ?

This embedding strategy is used to avoid having to initialize with a fixed-size vocabulary at the beginning of training. Instead, there is no specific bound on the number of words the model is able to learn. New words will continue to influence the training.

Incidentally, even without the hash embeddings most models would learn from unseen words: pretty much all models will have some vector for unknown words that's being updated. Most models also have subword features, which will cause the model to be updated from unknown words.

bartclauwens · February 4, 2020, 8:57am

Hi Honnibal,

Thank you for your answer. Do I understand correctly that not only in training but also when we use a previously trained model, unknown words are essentially mapped to random vectors? These vectors might randomly be associated strongly to a TextCat label. In practice I should first erase all words in a sentence that are not in the vocabulary to avoid the random effect of these associations? Does the same apply to words that do not have a vector ascribed?

Best,

Bart

honnibal · February 11, 2020, 12:06am

Hi Bart,

I wouldn't erase the unknown words, no. The context around each word is used by the CNN, so you might be messing up the structure by removing the unknown words. The model also uses subword features (first character, suffix, word shape) to help derive the vector for each word. Layer normalization is also used to prevent large magnitudes.

The model has been designed to deal with unknown words in a reasonable way, as this is an expected thing that any NLP application will have to deal with. You can always do live updates of the model as well, to teach it new words.

Topic		Replies	Views
Categorisation in foreign languages - are word vectors enough? textcat , spacy	2	630	May 3, 2019
Do the outputted models using textcat.batch-train make use of word vectors? usage , textcat , spacy	2	595	March 28, 2019
Out-of-vocabulary new NER model ner , spacy , solved	2	1276	September 15, 2018
Labeling & Training a Textcat with Contextual / Anchor Data usage , textcat	3	465	November 13, 2020
Textcat possible problem with uneven dataset? usage , textcat , done	2	956	January 17, 2020

TextCat outcome depends on words that are not in the vocabulary

Related topics