Vectors.from_glove expected file format

justindujardin · February 20, 2018, 7:51pm

I’m trying to use GloVe vectors from the Stanford page, and the formats on their site do not seem to match with your from_glove helper. The name of the function suggests it’s set up to ingest GloVe vectors in their “standard” format. I would expect the function to consume files in the format that the GloVe homepage offers.

Here’s a peek at the twitter vectors files that I downloaded, which are not in a binary format with a separate vocab.txt file like Spacy wants.

44 AM

Am I missing something?

justindujardin · February 26, 2018, 2:27am

GloVe has output format options, one of which is binary and another is text. The pre-trained vectors on their Github repo and Website are in text format. Here’s the script I wrote to parse the text version and create a spaCy model with only a vocab:

#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using GloVe and exported in text file format
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals
import numpy
import plac
import spacy
import tqdm

@plac.annotations(
  vectors_loc=("Path to GloVe pre-trained vectors .txt file", "positional", None, str),
  out_loc=("Path to output model that contains vectors", "positional", None, str),
  lang=("Optional language ID. If not set, 'en' will be used.", "positional", None, str),
)
def main(vectors_loc, out_loc, lang=None):
  if lang is None:
    lang = 'en'
  nlp = spacy.blank(lang)
  print('Loading GloVe vectors: {}'.format(vectors_loc))
  with open(vectors_loc, 'r') as file_:
    lines = file_.readlines()
    print('Assigning {:,} spaCy vectors'.format(len(lines)))
    for line in tqdm.tqdm(lines, leave=False):
      pieces = line.split(' ')
      word = pieces[0]
      vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
      nlp.vocab.set_vector(word, vector)
  print('Saving spaCy vector model: {}'.format(out_loc))
  nlp.to_disk(out_loc)
  print('Done.')

if __name__ == '__main__':
  plac.call(main)

It’s based on examples/vectors_fast_text.py from the spaCy repo.

Usage:

$ python glove_to_spacy.py glove.twitter.27B.25d.txt models/twitter-25/ 
Loading GloVe vectors: glove.twitter.27B.25d.txt
Assigning 1,193,514 spaCy vectors
Saving spaCy vector model: models/twitter-25/
Done.

Topic		Replies	Views
Convert Gensim FastText to spaCy-readable Word2Vec format for terms.teach recipe spacy , terms , solved , gensim	4	1519	September 11, 2020
How to use two .txt files one with vectors the other with words usage , spacy , solved	4	1959	May 26, 2018
word2vec model .bin spacy , off-topic	2	1426	November 4, 2020
Using Fastext vector model in Prodigy? usage , spacy , solved	7	3424	March 15, 2018
Loading gensim word2vec vectors for terms.teach? usage , terms , solved , gensim	17	5165	August 15, 2018

Vectors.from_glove expected file format

Related topics