Out-of-vocabulary new NER model

I want to train a new NER model, using my annotated data by following the example. If I’m using a pre-trained model, for example ‘en_core_web_lg’, how can I predict out-of-vocabulary entities? I can provide a lot of training data (thousands) with various animals (i.e cat, dog, etc.)

Isn’t this pre-trained model using word vectors and can potentially identify entities similar to the training ones? I’m a bit puzzled.

We’re able to learn new vocabulary items without resizing the embedding table. This is one of the big advantages of the hash embeddings used in spaCy. I explain it here: Can you explain how exactly HashEmbed works ?

Thanks Matt, I will try it.