Vocabulary construction & embedding training information of blank:en LM

Hi,

I may have missed these but I couldn't find much information about the tokenization method, vocabulary construction (e.g. size) and embedding training (e.g. size of embeddings, which LM training objective is used during training embeddings) when using a blank:en spacy model. I am hoping to achieve better results in my textcat task that has a much different corpus than the pretrained models (embeddings).

Tokenization
I am assuming the default spacy tokenizer is used as detailed here: How Tokenizer Works.

Vocabulary Construction
No information given about the size and construction method of the vocabulary (is it just X top frequent tokens in the corpus?)

Embeddings

  • What is the dimensions of the embedding vectors?
  • How are they trained based on the corpus given? What is the LM loss function?

Which page or section is this issue related to?

I opened an issue here too (https://github.com/explosion/spaCy/issues/6290) with the same write up.

Let's keep this discussion on the spacy issue tracker: https://github.com/explosion/spaCy/issues/6290

1 Like