Vocabulary construction & embedding training information of blank:en LM

atakanokan · October 22, 2020, 7:01pm

Hi,

I may have missed these but I couldn't find much information about the tokenization method, vocabulary construction (e.g. size) and embedding training (e.g. size of embeddings, which LM training objective is used during training embeddings) when using a blank:en spacy model. I am hoping to achieve better results in my textcat task that has a much different corpus than the pretrained models (embeddings).

Tokenization
I am assuming the default spacy tokenizer is used as detailed here: How Tokenizer Works.

Vocabulary Construction
No information given about the size and construction method of the vocabulary (is it just X top frequent tokens in the corpus?)

Embeddings

What is the dimensions of the embedding vectors?
How are they trained based on the corpus given? What is the LM loss function?

Which page or section is this issue related to?

I opened an issue here too (https://github.com/explosion/spaCy/issues/6290) with the same write up.

adriane · October 23, 2020, 8:57am

Let's keep this discussion on the spacy issue tracker: https://github.com/explosion/spaCy/issues/6290

Topic		Replies	Views
Training, pretraining best practices and deeper understanding usage , best-practices	3	956	October 24, 2019
Out-of-vocabulary new NER model ner , spacy , solved	2	1274	September 15, 2018
en_core_web_lg Sentence Tokenization with Minimal File Size textcat , spacy , custom	1	1046	July 5, 2020
Blank spacy model vs en_core_web_xx usage , ner , spacy , custom	2	876	October 25, 2021
SpaCy NER models Architecture details ner , spacy , solved	3	2638	June 21, 2021

Vocabulary construction & embedding training information of blank:en LM

Which page or section is this issue related to?

Related topics