Attributes of trained model

We are using the Span Annotator to train a model that recognizes medical concepts contained in unstructured medical documentation.

Given that the model vectorizes each NER based on a multi-token context window ( 4 tokens on each side of NER - default setting), ....and assuming that we are using a very large training corpus,

.... do the resultant nearest neighbor vectors in the trained model possess some form of relatedness ?

For example, would vectors for NERs: heart attack and myocardial infarction ( these are synonyms ) likely be found in proximity to each other using cosine similarity ?

Thanks very much


Hi @cwix, welcome to Prodigy!

If you're using spaCy's entity recognizer, then yes, the vectors should possess some form of relatedness. You can look into some of spaCy's model architectures and how it affects the word vectors and their similarity.