What parameters are optimized during ner model training?


I have been working on a Spacy NER model for a few days. First of all I would like to thank you, your tool is very helpful. I would like to better understand the training stage.

I had a look at your video [https://www.youtube.com/watch?v=sqDHBH9IjRU] explaining the NER model. I have understood that there are 3 parameter layers (please correct me if I am wrong):

  • the ones for the Bloom embedding ;
  • the ones for the contextual embedding by means of a CNN ;
  • the ones for the neural network in charge of the prediction.

If so, does this mean that during training all these parameters are optimized by means of nlp.update() method?
According to my small NLP experience, I thought that the embedding models were trained in advance, before training the downstream application (NER...), hence my question because it does not seam intuitive for me.

Besides, I would like to understand and to change the hyperparameters. I read on this page what is possible to do: https://spacy.io/api/cli#train-hyperparams. However there are some differences between the naming in web page and the names in nlp.get_pipe("ner").cfg. For instance I did not find the following parameters on the web page : cnn_maxout_pieces, nr_feature_tokens, nr_class. Maybe I was not attentive enough. Where could I find find the meaning of all the parameters of the ner model ? Furthermore, by means of which method can I tune them during the training of a new blank model ?

Thank you in advance for your responses.
Best regards,

Hi @capitaine,

In spaCy we have a separate embeddings table for vectors that are trained beforehand. This table (if available) is used as a feature in the embedding calculation. But the hash embeddings are always updated, regardless of whether the static vectors are provided as well. So the short answer to your first question is "Yes, all of the parameters you listed are updated".

Regarding the hyper-parameters, it's true that spaCy doesn't really expose the hyper-parameters very well, and they're sort of confusing and under-documented. The reason is that the models simply aren't very sensitive to those hyper-parameters, so there's not very much to be gained from changing them. Other hyper-parameters are really internals. For instance, if you set an unexpected value to nr_feature_tokens, the model will segfault --- you won't even get a Python error! So it's not really intended to be changed.

I've done quite a lot of experimenting with the hyper-parameters, and the only ones really worth tuning on individual problems are the batch size and the dropout rate. Tuning the other hyper-parameters will generally result in less than 1% improvement in accuracy, so we haven't made that a prominent workflow.

1 Like