The model appears to perform adequately for NER and I was hoping to extract the embeddings for some other usage in semantic similarity. However upon extracting the embeddings and calculating cosine similarities with a template like below I'm finding it performs quite poorly when scored against a word similarity dataset that's based on the same data that we used to train the prodigy NER model:
The above template code was ran through our similar-word pair dataset, where we have highly similar and dissimilar word pairs that we've prepared. This dataset worked quite well when we scored it with cosine similarities from a separate fasttext trained word embeddings on the same data that we trained our prodigy NER model with.
Did I miss a step or do something wrong here, or does this mean that a model trained in prodigy for NER cannot be used for its embeddings in semantic similarity work?
Although it doesn't matter for this short example, this is the pooled output from the transformer for the first span (=the first 128 spacy tokens) in the text, not the whole text:
I'm not sure how well it works for individual words, but you can get better sentence or text similarity results with approaches like SentenceTransformers and there's a third-party library that incorporates this into spacy: spaCy - sentence-transformers ยท spaCy Universe. SentenceTransformers (SBERT) was developed in part because researchers had noticed that averaging the transformer embeddings performs worse than averaging static word2vec/glove vectors for sentence similarity.
If you just need vector similarity for individual words, you can add the vectors from en_core_web_md or en_core_web_lg to a trf model like this:
Thank you, regarding the second approach there, after saving the model to disk and then calling it as the --base-model in prodigy train, would the vectors get updated with the training data?
Initially I wasn't sure if sentence transformers was needed since my word similarity usage is almost exclusively single words, I can give that a try as well.
The static vectors don't get updated during training. If your pipeline just includes components that listen to a transformer, then the vectors won't be used at all during training. I think if you're comparing single words, static word vectors sound like the right choice.
If you add a large vector table to the trf model it will make the overall pipeline package kind of large (1-2G), but if that doesn't cause problems for you, there's no technical reason not to have both.
(Note that if you're fine-tuning transformer+ner from en_core_web_trf with prodigy, you need to be aware that the tagger and parser performance will be degraded afterwards if you just add them back and they listen to the now-modified transformer.)