word embeddings from trained NER model?

I've trained an NER model using the below command:

python -m prodigy train ./mymodel --ner my_data --base-model en_core_web_trf

The model appears to perform adequately for NER and I was hoping to extract the embeddings for some other usage in semantic similarity. However upon extracting the embeddings and calculating cosine similarities with a template like below I'm finding it performs quite poorly when scored against a word similarity dataset that's based on the same data that we used to train the prodigy NER model:

import spacy
from numpy import dot
from numpy.linalg import norm

spacy_nlp=spacy.load('./mymodel/model-best/',exclude='tagger,parser')
nlp_orig = spacy.load("en_core_web_trf")
spacy_nlp.add_pipe("parser", source=nlp_orig, after="transformer")
spacy_nlp.add_pipe("tagger", source=nlp_orig, after="parser")

text1 = "glove"
text2 = "boots"
text1_embeddings = spacy_nlp(text1)._.trf_data.tensors[1][0]
text2_embeddings = spacy_nlp(text2)._.trf_data.tensors[1][0]

similarity = dot(text1_embeddings , text2_embeddings ) / (norm(text1_embeddings ) * norm(text2_embeddings ))

The above template code was ran through our similar-word pair dataset, where we have highly similar and dissimilar word pairs that we've prepared. This dataset worked quite well when we scored it with cosine similarities from a separate fasttext trained word embeddings on the same data that we trained our prodigy NER model with.

Did I miss a step or do something wrong here, or does this mean that a model trained in prodigy for NER cannot be used for its embeddings in semantic similarity work?

Thanks a lot.

Although it doesn't matter for this short example, this is the pooled output from the transformer for the first span (=the first 128 spacy tokens) in the text, not the whole text:

I'm not sure how well it works for individual words, but you can get better sentence or text similarity results with approaches like SentenceTransformers and there's a third-party library that incorporates this into spacy: https://spacy.io/universe/project/spacy-sentence-bert. SentenceTransformers (SBERT) was developed in part because researchers had noticed that averaging the transformer embeddings performs worse than averaging static word2vec/glove vectors for sentence similarity.

If you just need vector similarity for individual words, you can add the vectors from en_core_web_md or en_core_web_lg to a trf model like this:

vectors_nlp = spacy.load("en_core_web_md")
nlp = spacy.load("en_core_web_trf")
nlp.vocab.vectors = vectors_nlp.vocab.vectors
nlp.to_disk("/path/to/en_core_web_trf_with_vectors")

And then use this path as the base model instead of en_core_web_trf in prodigy.

Thank you, regarding the second approach there, after saving the model to disk and then calling it as the --base-model in prodigy train, would the vectors get updated with the training data?

Initially I wasn't sure if sentence transformers was needed since my word similarity usage is almost exclusively single words, I can give that a try as well.

The static vectors don't get updated during training. If your pipeline just includes components that listen to a transformer, then the vectors won't be used at all during training. I think if you're comparing single words, static word vectors sound like the right choice.

If you add a large vector table to the trf model it will make the overall pipeline package kind of large (1-2G), but if that doesn't cause problems for you, there's no technical reason not to have both.

(Note that if you're fine-tuning transformer+ner from en_core_web_trf with prodigy, you need to be aware that the tagger and parser performance will be degraded afterwards if you just add them back and they listen to the now-modified transformer.)