I’m interested in creating a new special token that spacy can use during NER. In particular, I’d like to train spacy to recognize the occupation (entity) of a particular person in the text. For example, given the sentence “Joe the teacher and John the engineer walked down the street”, if I cared about Joe, I’d want the model to mark “teacher” as the occupation entity.
To do so, I was thinking of marking the person of interest with a special token, e.g. transforming the sentence to “KEY_PERSON the teacher and John the engineer walked down the street.” The model could then learn to find the occupation of just this particular special token.
My question is, what’s the best way to do this or the best token to use? I’m guessing that using KEY_PERSON will trigger the prefix / suffix / shape word embeddings but not the original word’s embedding since it won’t be in the vocab. Is there a better token to use, or a way to allow spacy to learn the embedding for this special token?
(If it matters, I’m planning to start the NER task with a blank language model).
The textcat model extracts four or five lexical attributes per token, depending on whether youre using pre-trained vectors. Tokens fetch their lexical attributes via a pointer to a lexical type. All tokens of the same type point to the same underlying lexical data – so you can set the lexical attributes there.
this interpretation of tokens also pertains to the NER model? (you mentioned texcat model in your reply)
If I set the norm of a token to “hello”, the norm embedding of that token used in the model will also be that of “hello”?
When you mention it could just learn the new word - would spacy retrain norm embeddings? Either for a new norm or for an existing one. I’m assuming that a new norm has embeddings to be all 0., so it would be indistinguishable from any other out-of-vocab norm (though I can’t easily check for a blank model).
If I understand the question, yes. The flow goes like doc.to_array([NORM, PREFIX, SUFFIX, SHAPE]). Then we have a set of embedding tables that each grab a column from the table. If you set two terms to have identical norm, they’ll have the same values in the feature array, and then get assigned the same norm vector. Of course if the other features are different (prefix, suffix, shape) the words will get somewhat different vectors in the end.
This bit is rather tricky: we don’t have OOV terms, because we’re using “hash embeddings”. Let’s say we have an embedding table of 7000 rows for the norms. Normally you would give 6999 rows to the vocabulary, and 1 row would be shared by all OOV words. We don’t do this.
Instead we rehash the feature 4 times, and mod each into the table. Then we sum the pieces:
This means that we’ll always assign a non-zero vector for a word the first time we see it, and in fact the vector we assign is likely to be distinct from any other unseen word. As you make updates, the model learns to refine the representation for that word, just as it would refine the representation of an “in vocabulary” word in a normal embedding table.
So we don’t have OOV words: we can learn an unbounded number of vocabulary items, by letting words share parts of their representations. In theory it should be hard to train this; in practice it’s really not — think of all the other ways we can break our models and still have them learn things. To paraphrase Jurassic Park: the model uh, finds a way.