Hi,
I may have missed these but I couldn't find much information about the tokenization method, vocabulary construction (e.g. size) and embedding training (e.g. size of embeddings, which LM training objective is used during training embeddings) when using a blank:en spacy model. I am hoping to achieve better results in my textcat task that has a much different corpus than the pretrained models (embeddings).
Tokenization
I am assuming the default spacy tokenizer is used as detailed here: How Tokenizer Works.
Vocabulary Construction
No information given about the size and construction method of the vocabulary (is it just X top frequent tokens in the corpus?)
Embeddings
- What is the dimensions of the embedding vectors?
- How are they trained based on the corpus given? What is the LM loss function?
Which page or section is this issue related to?
I opened an issue here too (https://github.com/explosion/spaCy/issues/6290) with the same write up.