I’m trying to use GloVe vectors from the Stanford page, and the formats on their site do not seem to match with your from_glove
helper. The name of the function suggests it’s set up to ingest GloVe vectors in their “standard” format. I would expect the function to consume files in the format that the GloVe homepage offers.
Here’s a peek at the twitter vectors files that I downloaded, which are not in a binary format with a separate vocab.txt file like Spacy wants.
Am I missing something?
GloVe has output format options, one of which is binary and another is text. The pre-trained vectors on their Github repo and Website are in text format. Here’s the script I wrote to parse the text version and create a spaCy model with only a vocab:
#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using GloVe and exported in text file format
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals
import numpy
import plac
import spacy
import tqdm
@plac.annotations(
vectors_loc=("Path to GloVe pre-trained vectors .txt file", "positional", None, str),
out_loc=("Path to output model that contains vectors", "positional", None, str),
lang=("Optional language ID. If not set, 'en' will be used.", "positional", None, str),
)
def main(vectors_loc, out_loc, lang=None):
if lang is None:
lang = 'en'
nlp = spacy.blank(lang)
print('Loading GloVe vectors: {}'.format(vectors_loc))
with open(vectors_loc, 'r') as file_:
lines = file_.readlines()
print('Assigning {:,} spaCy vectors'.format(len(lines)))
for line in tqdm.tqdm(lines, leave=False):
pieces = line.split(' ')
word = pieces[0]
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector)
print('Saving spaCy vector model: {}'.format(out_loc))
nlp.to_disk(out_loc)
print('Done.')
if __name__ == '__main__':
plac.call(main)
It’s based on examples/vectors_fast_text.py from the spaCy repo.
Usage:
$ python glove_to_spacy.py glove.twitter.27B.25d.txt models/twitter-25/
Loading GloVe vectors: glove.twitter.27B.25d.txt
Assigning 1,193,514 spaCy vectors
Saving spaCy vector model: models/twitter-25/
Done.
1 Like