Tensor Board and spaCy/Gensim model

madhujahagirdar · January 27, 2018, 3:22am

Can I use the trained word2vec model to visualize in tensorboard to understand the relationships of words ?

ines · January 27, 2018, 10:10am

This is cool! By default, all models that Prodigy exports are spaCy model packages, so it won’t work out-of-the-box. However, I just had a look at the data the Embedding Projector expects, and we might be able to offer a converter or something like that for spaCy

madhujahagirdar · March 3, 2018, 2:20pm

@ines any luck that it would come in next release? or if you point to me some direction I can work on this problem.

ines · March 3, 2018, 2:48pm

I haven’t really had much time yet to look at this in detail (and it’s not the highest on our list of priorities, since we want to focus on the core functionalities first). But if you want to give this a go, this would be super cool! I’m sure the community would love this, too.

The embeddings projector shows the following examples for the input data:

Vectors (3 vectors of 4 dimensions):

0.1\t0.2\t0.5\t0.9
0.2\t0.1\t5.0\t0.2
0.4\t0.1\t7.0\t0.8

Metadata:

Pokémon\tSpecies
Wartortle\tTurtle
Venusaur\tSeed
Charmeleon\tFlame

So if you have a spaCy model with vectors, you could iterate over the nlp.vocab.vectors.data (a numpy array) and generate the CSV format from it. Alternatively, you could also iterate over the vocab, and add the entry in the vocabulary to the metadata and the associated vector to the vectors file.

If you end up trying this, definitely keep us updated!

madhujahagirdar · March 3, 2018, 5:32pm

Sure, will give it try over the weekend. Thanks.

madhujahagirdar · March 6, 2018, 2:59pm

Do we have a way in which we can convert spacy model to gensim model? I want to load model into tensorboard for visualization and currently, i can do through gensim model and cannot upload spacy model to generate visualization

beckerfuffle · March 1, 2018, 8:29pm

Looks like tensorboard support might be in the works:

If you're looking for a shorter-term solution I bet you could figure out how to do it by modifying the code in this gist: Convert gensim word2vec to tensorboard visualized model, detail: https://eliyar.biz/using-pre-trained-gensim-word2vector-in-a-keras-model-and-visualizing/ · GitHub

madhujahagirdar · March 1, 2018, 8:32pm

I am already on it, but now need to convert spacymodel to gensim. Is there a script to do it ?

beckerfuffle · March 6, 2018, 2:46pm

I was suggesting that you could figure out a way to convert the spacy word vectors directly to tensorboard format. Something similar but not identical to the gist I posted (since it’s designed for gensim not spacy).

ines · March 6, 2018, 3:00pm

Merged the two Tensorboard-related threads to keep things in one place!

justindujardin · March 21, 2018, 5:58pm

I’m also interested in visualizing vectors in Tensorboard, so I made a spaCy compatible version of the Gist you linked. Thanks

spacy_vectors_to_tensorboard.py

#!/usr/bin/env python
# coding: utf8
"""Visualize spaCy word vectors in Tensorboard.

Adapted from: https://gist.github.com/BrikerMan/7bd4e4bd0a00ac9076986148afc06507
"""
from __future__ import unicode_literals

import numpy as np
import os
import plac
import spacy
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector


def visualize(model, output_path, tensor_name):
  vectors = model.vocab.vectors
  print('Initializing Tensorboard Vectors: {}'.format(vectors.shape))
  meta_file = "{}.tsv".format(tensor_name)
  placeholder = np.zeros(vectors.shape)

  with open(os.path.join(output_path, meta_file), 'wb') as file_metadata:
    for i, (key, vector) in enumerate(vectors.items()):
      placeholder[i] = vector
      if key not in model.vocab.strings:
        continue
      text = model.vocab[key].text
      # https://github.com/tensorflow/tensorflow/issues/9094
      file_metadata.write("{0}\n".format('<Space>' if text. lstrip() == '' else text).encode('utf-8'))

  # define the model without training
  sess = tf.InteractiveSession()

  embedding = tf.Variable(placeholder, trainable=False, name=tensor_name)
  tf.global_variables_initializer().run()

  saver = tf.train.Saver()
  writer = tf.summary.FileWriter(output_path, sess.graph)

  # adding into projector
  config = projector.ProjectorConfig()
  embed = config.embeddings.add()
  embed.tensor_name = tensor_name
  embed.metadata_path = meta_file

  # Specify the width and height of a single thumbnail.
  projector.visualize_embeddings(writer, config)
  saver.save(sess, os.path.join(output_path, '{}.ckpt'.format(tensor_name)))


@plac.annotations(
  vectors_loc=("Path to spaCy model that contains word vectors", "positional", None, str),
  out_loc=("Path to output tensorboard vector visualization data", "positional", None, str),
  name=("Human readable name for tsv file and tensor name", "positional", None, str),
)
def main(vectors_loc, out_loc, name="spaCy_vectors"):
  print('Loading spaCy vectors model: {}'.format(vectors_loc))
  nlp = spacy.load(vectors_loc)
  print('Writing Tensorboard visualization: {}'.format(out_loc))
  visualize(nlp, out_loc, name)
  print('Done. Run `tensorboard --logdir={0}` to view in Tensorboard'.format(out_loc))


if __name__ == '__main__':
  plac.call(main)

Use like:

python spacy_vectors_to_tensorboard.py ./your-model/ ./output/spaCy-vectors/

madhujahagirdar · March 21, 2018, 9:38pm

@justindujardin I am getting the following error, did you get this error ?

justindujardin · March 21, 2018, 9:52pm

@madhujahagirdar I didn't see the same problem, sorry.

It looks like maybe I messed up the lines above? If you follow the link to github in the code, the error you report is described by it. I think the problem is that you have blank words in your vectors. They should be filled in with something (like <Empty Line>) I think if you update the code above to do that, then it should work.

Maybe if you replace the lines I put above with this?

      # https://github.com/tensorflow/tensorflow/issues/9094
      file_metadata.write("{0}\n".format(text or '<Empty Line>').encode('utf-8'))

madhujahagirdar · March 21, 2018, 10:05pm

The following code worked for me:

    for i, (key, vector) in enumerate(vectors.items()):
      placeholder[i] = vector
      if key in model.vocab.strings:
        text = model.vocab[key].text

        #I add the lstrip and it worked 
        if text.lstrip() == '':
          print("Emply Line, should replecaed by any thing else, or will cause a bug of tensorboard")
          file_metadata.write("{0}".format('<Empty Line>').encode('utf-8') + b'\n')
        else:
          file_metadata.write("{0}".format(text).encode('utf-8') + b'\n')
      else:
        print("The key",key,"is not in the vocab")

justindujardin · March 21, 2018, 10:27pm

I’m glad it worked for you!

Topic		Replies	Views
Bus Error/Segmentation Fault - Custom Gensim Vectors done , spacy , solved	3	802	July 10, 2018
biomedical nlp models in spacy usage , spacy , solved , gensim	4	2399	February 28, 2018
Loading gensim word2vec vectors for terms.teach? usage , terms , solved , gensim	17	5143	August 15, 2018
Convert Gensim FastText to spaCy-readable Word2Vec format for terms.teach recipe spacy , terms , solved , gensim	4	1494	September 11, 2020
Help with training from scratch english NER model with pretrained Gensim vectors usage , ner , spacy	2	642	January 27, 2022

Tensor Board and spaCy/Gensim model

Related topics