Tensor Board and spaCy/Gensim model

Can I use the trained word2vec model to visualize in tensorboard to understand the relationships of words ?

This is cool! By default, all models that Prodigy exports are spaCy model packages, so it won’t work out-of-the-box. However, I just had a look at the data the Embedding Projector expects, and we might be able to offer a converter or something like that for spaCy :blush:

@ines any luck that it would come in next release? or if you point to me some direction I can work on this problem.

I haven’t really had much time yet to look at this in detail (and it’s not the highest on our list of priorities, since we want to focus on the core functionalities first). But if you want to give this a go, this would be super cool! :+1: I’m sure the community would love this, too.

The embeddings projector shows the following examples for the input data:

Vectors (3 vectors of 4 dimensions):

0.1\t0.2\t0.5\t0.9
0.2\t0.1\t5.0\t0.2
0.4\t0.1\t7.0\t0.8 

Metadata:

Pokémon\tSpecies
Wartortle\tTurtle
Venusaur\tSeed
Charmeleon\tFlame 

So if you have a spaCy model with vectors, you could iterate over the nlp.vocab.vectors.data (a numpy array) and generate the CSV format from it. Alternatively, you could also iterate over the vocab, and add the entry in the vocabulary to the metadata and the associated vector to the vectors file.

If you end up trying this, definitely keep us updated!

Sure, will give it try over the weekend. Thanks.

Do we have a way in which we can convert spacy model to gensim model? I want to load model into tensorboard for visualization and currently, i can do through gensim model and cannot upload spacy model to generate visualization

Looks like tensorboard support might be in the works:

If you're looking for a shorter-term solution I bet you could figure out how to do it by modifying the code in this gist: Convert gensim word2vec to tensorboard visualized model, detail: https://eliyar.biz/using-pre-trained-gensim-word2vector-in-a-keras-model-and-visualizing/ · GitHub

1 Like

I am already on it, but now need to convert spacymodel to gensim. Is there a script to do it ?

I was suggesting that you could figure out a way to convert the spacy word vectors directly to tensorboard format. Something similar but not identical to the gist I posted (since it’s designed for gensim not spacy).

Merged the two Tensorboard-related threads to keep things in one place!

I’m also interested in visualizing vectors in Tensorboard, so I made a spaCy compatible version of the Gist you linked. Thanks :clap:

spacy_vectors_to_tensorboard.py

#!/usr/bin/env python
# coding: utf8
"""Visualize spaCy word vectors in Tensorboard.

Adapted from: https://gist.github.com/BrikerMan/7bd4e4bd0a00ac9076986148afc06507
"""
from __future__ import unicode_literals

import numpy as np
import os
import plac
import spacy
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector


def visualize(model, output_path, tensor_name):
  vectors = model.vocab.vectors
  print('Initializing Tensorboard Vectors: {}'.format(vectors.shape))
  meta_file = "{}.tsv".format(tensor_name)
  placeholder = np.zeros(vectors.shape)

  with open(os.path.join(output_path, meta_file), 'wb') as file_metadata:
    for i, (key, vector) in enumerate(vectors.items()):
      placeholder[i] = vector
      if key not in model.vocab.strings:
        continue
      text = model.vocab[key].text
      # https://github.com/tensorflow/tensorflow/issues/9094
      file_metadata.write("{0}\n".format('<Space>' if text. lstrip() == '' else text).encode('utf-8'))

  # define the model without training
  sess = tf.InteractiveSession()

  embedding = tf.Variable(placeholder, trainable=False, name=tensor_name)
  tf.global_variables_initializer().run()

  saver = tf.train.Saver()
  writer = tf.summary.FileWriter(output_path, sess.graph)

  # adding into projector
  config = projector.ProjectorConfig()
  embed = config.embeddings.add()
  embed.tensor_name = tensor_name
  embed.metadata_path = meta_file

  # Specify the width and height of a single thumbnail.
  projector.visualize_embeddings(writer, config)
  saver.save(sess, os.path.join(output_path, '{}.ckpt'.format(tensor_name)))


@plac.annotations(
  vectors_loc=("Path to spaCy model that contains word vectors", "positional", None, str),
  out_loc=("Path to output tensorboard vector visualization data", "positional", None, str),
  name=("Human readable name for tsv file and tensor name", "positional", None, str),
)
def main(vectors_loc, out_loc, name="spaCy_vectors"):
  print('Loading spaCy vectors model: {}'.format(vectors_loc))
  nlp = spacy.load(vectors_loc)
  print('Writing Tensorboard visualization: {}'.format(out_loc))
  visualize(nlp, out_loc, name)
  print('Done. Run `tensorboard --logdir={0}` to view in Tensorboard'.format(out_loc))


if __name__ == '__main__':
  plac.call(main)

Use like:

python spacy_vectors_to_tensorboard.py ./your-model/ ./output/spaCy-vectors/
2 Likes

@justindujardin I am getting the following error, did you get this error ?

@madhujahagirdar I didn't see the same problem, sorry.

It looks like maybe I messed up the lines above? If you follow the link to github in the code, the error you report is described by it. I think the problem is that you have blank words in your vectors. They should be filled in with something (like <Empty Line>) I think if you update the code above to do that, then it should work.

Maybe if you replace the lines I put above with this?

      # https://github.com/tensorflow/tensorflow/issues/9094
      file_metadata.write("{0}\n".format(text or '<Empty Line>').encode('utf-8'))

The following code worked for me:

    for i, (key, vector) in enumerate(vectors.items()):
      placeholder[i] = vector
      if key in model.vocab.strings:
        text = model.vocab[key].text

        #I add the lstrip and it worked 
        if text.lstrip() == '':
          print("Emply Line, should replecaed by any thing else, or will cause a bug of tensorboard")
          file_metadata.write("{0}".format('<Empty Line>').encode('utf-8') + b'\n')
        else:
          file_metadata.write("{0}".format(text).encode('utf-8') + b'\n')
      else:
        print("The key",key,"is not in the vocab")
1 Like

I’m glad it worked for you! :+1: