Hi! I answered your transfer learning question in the other thread I hope this explains some of the terminology and what transfer learning usually refers to:
The idea is that you can often get better results by initialising your model with better representations that encode some knowledge about the language and the world β for example, language model weights that were trained by predicting the next word. For example, one way to do this is by initialising the model with transformer embeddings and using those for the token-to-vector embedding. Your tokens end up with "better" and more representative vectors this way, and when those are used as features in the component models (e.g. NER), you often see better results. What's "pretrained" here is the token-to-vector embedding layer β not the actual task-specific component like NER.