Can you explain how exactly HashEmbed works ?

try to understand the following snippet, and the rational behind it …

    nlp = spacy.blank('en')
    nlp.vocab.lex_attr_getters[PREFIX] = lambda string: string[:3]
    nlp.vocab.lex_attr_getters[SUFFIX] = lambda string: string[-3:]

    doc = nlp('brown fox jumps over lazy dog')
    feat = doc.to_array([LOWER, SHAPE, PREFIX, SUFFIX])


    width = 32
    lower_case = HashEmbed(width, 100, column=0)
    shape = HashEmbed(width//2, 200, column=1)
    prefix = HashEmbed(width//2, 100, column=2)
    suffix = HashEmbed(width//2, 100, column=3)

Here’s a rough draft of a write-up of this:

Chapter 4: Compact word vectors with Bloom Embeddings

A high-coverage word embedding table will usually be quite large.
One million 32-bit floats occupies 4MB of memory, so one million
300-dimensional vectors will be 1.2GB in size. The large model
size is at least annoying for many applications, while for others
it’s completely prohibitive.

There are three obvious approaches to reducing the size of the embedding table:

  1. Reduce the number of words in the vocabulary.
  2. Reduce the number of dimensions per vector.
  3. Reduce the number of bits per dimension.

While all three of these options can be effective, there’s also a less
obvious solution:

  1. Cheat, using a probabilistic data structure

Probabilistic data structures are a natural fit for machine learning
models, so they’re quite widely used. However, they’re definitely
unintuitive — which is why we refer to this solution as a “cheat”.
We’ll start by introducing the full algorithm, without dwelling too
long on why it works. We’ll then go back and fill in more of the
intuition, before finally connecting it to related ideas such as Bloom
filters and the “hashing trick” that’s common in web-scale machine
learning tasks such as advertising placement, information retrieval
and content recommendation.

4.1 The Bloom Embeddings algorithm

In a normal embedding table, each word-string is mapped to a distinct
ID. Usually these IDs will be sequential — so if you have a vocabuary
of 100 words, your words will be mapped to numbers range(100). The sequential
IDs can then be used as indices into an embedding table: if you have 100
words in your vocabulary, you have 100 rows in the table, and each word receives
its own vector.

However, there’s no limit to the number of unique words that might occur in a
sample of text — but we definitely want a limited number of rows in our
embedding table. Some of the rows in our table will therefore need to be shared
between multiple rows in our vocabulary. One obvious solution is to set asside
a single vector in the table. Words 0-98 will each receive their own vector,
while all other words are assigned to vector 99:

However, this asks vector 99 to do a lot of work. What if we gave more vectors
to the unknown words?

Instead of only one unknown word vector, we could have 10: words 0-89 would be
assigned their own vector as before. However, all other words would be randomly
mapped to the remaining 10 vectors:


def get_row(word_id, number_vector=100, number_oov=10):
    if word_id < (number_vector - number_oov):
        return word_id
    else:
        return number_vector + (word_id % number_oov)

This gives the model a little more resolution for the unknown words. If all
out-of-vocabulary words are assigned the same vector, then they’ll all look
identical to the model. Even if the training data actually includes information
that shows two different out-of-vocabulary words have important, different implications
– for instance, if one word is a strong indicator of positive sentiment, while the
other is a strong indicator of negative sentiment — the model won’t be able
to tell them apart. However, if we have 10 buckets for the unknown words, we might
get lucky, and assign these words to different buckets. If so, the model would be
able to learn that one of the unknown-word vectors makes positive sentiment more likely,
while the other vector makes negative sentiment more likely.

If this is good, then why not do more of it? Bloom Embeddings are like an extreme
version, where every word is handled like the unknown words above — there are 100
vectors for the “unknown” portion, and 0 for the “known” portion.

So far, this approach seems weird, but not necessarily good. The part that makes it
unfairly effective is the next step: by simply doing the same thing multiple times,
we can greatly improve the resolution, and unique representations to far more
words than we have vectors. The code in full:

import mmh3

def allocate(n_dimensions, n_vectors):
    table = numpy.zeros((n_dimensions, n_vectors), dtype='f')
    table += numpy.random.uniform(-0.1, 0.1, table.size).reshape(table.shape)
    return table
    
def get_vector(table, word):
    hash1 = mmh3.hash(word, seed=0)
    hash2 = mmh3.hash(word, seed=1)
    row1 = hash1 % table.shape[0]
    row2 = hash2 % table.shape[0]
    return table[row1] + table[row2]
    
def update_vector(table, word, d_vector):
    hash1 = mmh3.hash(word, seed=0)
    hash2 = mmh3.hash(word, seed=1)
    row1 = hash1 % table.shape[0]
    row2 = hash2 % table.shape[0]
    table[row1] -= 0.001 * d_vector
    table[row2] -= 0.001 * d_vector

In this example, we’ve used two keys, assigned from two random hash functions. It’s
unlikely that two words will collide on both keys, so by simply summing the vectors
together, we’ll assign most words a unique representation.

4.2 Example

For the sake of illustration, let’s step through a very small example, explicitly.

Let’s say we have this vocabulary of 20 words:

vocab = ['apple', 'strawberry', 'orange', 'juice', 'drink', 'smoothie',
         'eat', 'fruit', 'health', 'wellness', 'steak', 'fries', 'ketchup',
         'burger', 'chips', 'lobster', 'caviar', 'service', 'waiter', 'chef']

We’ll embed these into two dimensions. Normally this would give us a table of
(20, 2) floats, which we would randomly initialise. With the hashing trick,
we can make the table smaller. Let’s give it 15 vectors:

normal = numpy.random.uniform(-0.1, 0.1, (20, 2), dtype='f')

hashed = numpy.random.uniform(-0.1, 0.1, (15, 2), dtype='f')

In the normal table, we want to map each word in our vocabulary to its own vector:

word2id = {}
def get_normal_vector(word, table):
    if word not in keys:
        word2id[word] = len(word2id)
    return table[word2id[word]]

The hashed table only has 15 rows, so some words will have to share. We’ll handle
this by mapping the word into an arbitrary integer – called a “hash value”. The
hash function will return an arbitrary integer, which we’ll mod into the range
(0, 20). Importantly, we need to be able to compute multiple, distinct hash
values for each key – so Python’s built-in hash function is inconvenient. We’ll
therefore use Murmurhash.

Let’s see what keys we get for our 20 vocabulary items, using Murmurhash:

>>> hashes1 = [mmh3.hash(w, 1) % 15 for w in vocab]
>>> hashes1
[3, 6, 4, 13, 8, 3, 13, 1, 9, 12, 11, 4, 2, 13, 5, 10, 0, 2, 10, 13]

As you can see, some keys are shared between multiple words, while 2/15 keys are
unoccupied. This is obviously unideal! If multiple words have the same key,
they’ll map to the same vector – as far as the model is concerned, “strawberry”
and “heart” will be indistinguishable. It won’t be clear which word was used –
they have the same representation.

To address this, we simply hash the words again, this time using a different seed
– so that we get a different set of arbitrary keys:

>>> hashes2 = [mmh3.hash(w, 2) % 15 for w in vocab]
>>> len(Counter(hashes2).most_common())
12

This one’s even worse – 3 keys unoccupied! But our strategy is not to
keep drawing until we get a favourable seed. Instead, consider this:

>>> len(Counter(zip(hashes1, hashes2)))
20

By combining the results from the two hashes, our 20 words distribute perfectly,
into 20 unique combinations. This makes sense: we expect to have some words
overlapping on one of the keys, but we’d have to be very unlucky for a pair of
words to overlap on both keys.

This means that if we simply add the two vectors together, each word once more
has a unique representation:

>>> for word in vocab:
...   key1 = mmh3.hash(word, 0) % 15
...   key2 = mmh3.hash(word, 1) % 15
...   vector = hash_embed[key1] + hash_embed[key2]
...   print word, '%.3f %.3f' % tuple(vector)
...
apple 0.161 0.163
strawberry 0.128 -0.024
orange 0.157 -0.047
juice -0.017 -0.023
drink 0.097 -0.124
smoothie 0.085 0.024
eat 0.000 -0.105
fruit -0.060 -0.053
health 0.166 0.103
wellness 0.011 0.065
steak 0.155 -0.039
fries 0.157 -0.106
ketchup 0.076 0.127
burger 0.045 -0.084
chips 0.082 -0.037
lobster 0.138 0.067
caviar 0.066 0.098
service -0.017 -0.023
waiter 0.039 0.001
chef -0.016 0.030

We now have a function that maps our 20 words to 20 unique vectors – but we’re
storing weights for only 15 vectors in memory. Now the question is: will we be able to find values for these
weights that let us actually map words to useful vectors?

Let’s do a quick experiment to see how this works. We’ll assign “true” values for our little
vocabulary, and see how well we can approximate them with our compressed table. To get the “true”
values, we could put the “science” in data science, and drag the words around into
reasonable-looking clusters. But for our purposes, the actual “true” values don’t matter.
We’ll therefore just do a simulation: we’ll assign random vectors as the “true” state, and see
if we can learn values for the hash embeddings that match them.

The learning proceedure will be a simple stochastic gradient descent:


    import numpy
    import numpy.random as random
    import mmh3

    random.seed(0)

    nb_epoch = 10
    learn_rate = 0.001
    nr_hash_vector = 15

    words = [str(i) for i in range(20)]
    true_vectors = numpy.random.uniform(-0.1, 0.1, (len(words), 2))
    hash_vectors = numpy.random.uniform(-0.1, 0.1, (nr_hash_vector, 2))
    examples = list(zip(words, true_vectors))

    for epoch in range(nb_epoch):
        random.shuffle(examples)
        loss=0.
        for word, truth in examples:
            key1 = mmh3.hash(word, 0) % nr_hash_vector
            key2 = mmh3.hash(word, 1) % nr_hash_vector

            hash_vector = hash_vectors[key1] + hash_vectors[key2]

            diff = hash_vector - truth

            hash_vectors[key1] -= learn_rate * diff
            hash_vectors[key2] -= learn_rate * diff
            loss += (diff**2).sum()
        print(epoch, loss)

It’s worth taking some time to play with this simulation. You can start by doing some sanity checks:

  • How does the loss change with nr_hash_vector?
  • If you remove key2, does the loss go up?
  • What happens if you add more hash keys?
  • What happens as the vocabulary size increases?
  • What happens when more dimensions are added?
  • How sensitive are the hash embeddings to the initial conditions? If we change the random seed,
    do we ever get unlucky?

If you play with the simulation for a while, you’ll start to get a good feel for the dynamics —
and hopefully you’ll have a clear idea of why the technique works. Armed with that insight,
let’s move forward to a practical recipe for computing word representations, which can be used
in any model for deep learning with text.

6 Likes

Thank you so so much, I am going to read this and comment back!

BTW - are you writing a book for SpaCy + thinc ? (that would be awesome!)

@honnibal so it seems like HashEmbed is a way to map the inputs to embeddings (with double hash trick), the values of which will be learnt as part of the training process?

Thanks @honnibal . It is really a good trick for learning word representations. I will try to use this in some of use cases too.

HEllo @honnibal thanks for the clarification. is this part of a book, i would really enjoy the book.

Also is the bloom embedding used in all the models like the french model ?

isn't this (n_vectors, n_dimensions)

isn't this (0,15) ?

Thanks @honnibal for the great explanation.

I've a couple of related questions: Is the MultiHashEmbed in Spacy pre-trained (in pipelines like en_core_web_sm?) I mean, are the weights stored in en_core_web_sm? Can we just use it as is, or is there any necessary training?

If it was pre-trained, how was it? Using end-to-end language pipeline on some downstream tasks? Or just using another trained word-embeddings (vectors) that MultiHashEmbed was trying to approximate?

As I am a newbie to Spacy, to better understand how pipelines not containing static word vectors work, I will appreciate if there is any reference to the training code on GitHub (THiNC or Spacy repo)

Thanks :slightly_smiling_face:

Since this is the Prodigy forum, it might be best to ask spaCy specific question on the spaCy discussion board. That said, you might enjoy this blogpost from the blog, which doesn't answer all your questions, but does explain some more details.