replace sens2vec with transformer model from hugging face

Hello, I watched this video where Ines explains how to do NER in prodigy. My question is how hard and what would be the steps to replace the sens2vec with a model from hugging face transformers or even any model that can compute embedding for words
prodigy sens2vec.teach ........
Thanks in advance.

Hi! That's a nice idea and shouldn't be too difficult to implement.

You can find the source of the sense2vec.teach recipe here. However, I think the terms.teach code might be a better place to start and use as a template, because it doesn't contain all the sense2vec-specific stuff for retrieving vectors keyed by word and tag (POS, entity label).

If you use a model via spacy-transformers, the above code may almost work out-of-the-box and you won't have to change much, since the .similarity() methods uses the transformer model embeddings if they're available.

But even if you do decide to re-implement it, the idea is pretty simple:

  1. Keep a "target vector" of the seed terms and the terms you've accepted.
  2. Loop over the vocabulary and compare each term's similarity to the target.
  3. If it's above a certain threshold, send it out. Otherwise, skip it.
1 Like

Awesome, thanks a lot. and btw thanks for the nice feature about relation labelling :slight_smile:

I took a look to the pattern.match but it is said in the doc that it only works with token in the vocabulary which means i can't use it since i wanna use this to help on a NER task (many words) before manual labelling. Am i wrong ?

I'm not sure I understand the question, sorry! So did you already collect terms using your custom terms.teach recipe, or are you still working on that?

1 Like

No, I can't use terms.teach since it can only handle token in vocabulary, but i have entities that i want to have their "synonyms" and they are spans of words. just as you did in you food NER example with sens2vec.

Well, you do need embeddings for those phrases and you need to load those phrases from somewhere and iterate over them. In terms.teach, we're using the model's vocab. In sense2vec.teach, we're using the entries in the sense2vec vectors. So whichever embeddings you're using, you need to extracting potentially similar candidates so you can check the similarity and decide what to suggest.

ah I see I just have to modify the code in terms.teach even when using spacy transformers.