Bus Error/Segmentation Fault - Custom Gensim Vectors

mitch · July 9, 2018, 10:12am

Hi,

I have trained my own word2vec model on a corpus of documents. I now want to create a blank spacy model that uses those vectors. To do this I have taken code I have seen on this forum - see code below…

Once I have my blank spacy model with custom vectors I attempt to do an ner.batch-train:
python -m prodigy ner.batch-train my_dataset en_core_apparel --output my_output --label MY_LABELS --eval-split 0.2 --n-iter 30 --batch-size 8

The recipe successfully loads the model and prints out that its using 20% of accept/reject examples. And then it errors out with either a bus error or a segementation fault. Any ideas on what can be wrong? Am I adding my vectors correctly? - My word2vec model is not large - bin file ~40MB - 19351 entries, 19420 vectors

Code I use to create a blank model:

from gensim import models
import spacy
import numpy as np
from prodigy.util import export_model_data
from spacy.lang.en import English
from spacy.pipeline import EntityRecognizer
from spacy.pipeline import SentenceSegmenter
from spacy.pipeline import DependencyParser
import logging


def pkl_to_bin(pkl_file, bin_file):
    logging.info("converting word2vec to bin file")
    word2vec = models.Word2Vec.load(pkl_file)
    word2vec.wv.save_word2vec_format(bin_file)


def create_blank_spacy(entities):
    logging.info("creating blank model")
    nlp = spacy.blank('en')
    tokenizer = English().Defaults.create_tokenizer(nlp)
    ner = EntityRecognizer(nlp.vocab)
    for entity in entities:
        ner.add_label(entity)

    nlp.add_pipe(ner)
    nlp.add_pipe(SentenceSegmenter(nlp.vocab))
    nlp.add_pipe(DependencyParser(nlp.vocab))
    return nlp


def add_vectors_to_model(spacy_model, w2v_bin_file):
    logging.info("adding vectors to blank model")
    rows, cols = 0, 0
    for i, line in enumerate(open(w2v_bin_file, 'r')):
        if i == 0:
            rows, cols = line.split()
            rows, cols = int(rows), int(cols)
            spacy_model.vocab.reset_vectors(shape=(rows,cols))
        else:
            word, *vec = line.split()
            vec = np.array([float(i) for i in vec])
            spacy_model.vocab.set_vector(word, vec)


def model_to_disk(model, model_name, model_outfile):
    logging.info("saving model to disk")
    optimizer = nlp.begin_training(lambda: [])
    model.meta['name'] = model_name
    model.to_disk(model_outfile)


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    word2vec_model_file = "../apparel/models/w2v/product_word2vec.pkl"
    word2vec_bin_file = "../apparel/models/w2v/product_word2vec.bin"
    spacy_word2vec = "en_core_apparel"

    pkl_to_bin(word2vec_model_file, word2vec_bin_file)
    nlp = create_blank_spacy(['BRAND', 'PRODUCT_TYPE', 'SIZE', 'MATERIAL', 'AGE', 'COLOUR', 'GENDER'])
    add_vectors_to_model(nlp, word2vec_bin_file)
    model_to_disk(nlp, spacy_word2vec, spacy_word2vec)

A bit more info:

I am on spacy version 2.0.11. Running python 3.6.1

honnibal · July 9, 2018, 10:01pm

Thanks for the detailed example! Your code is also very readable, which makes things much easier…

We’ve been trying to track down a bug that causes segmentation faults in NER training across a few other issues, but I actually wonder whether you’re seeing a different issue here.

Does it still crash if you remove the parser and sentence segmenter? I’m particularly suspicious of the parser, as it would be an entirely blank model. Also, does it still crash if you don’t add the word vectors, but do everything else the same as you have here?

mitch · July 10, 2018, 2:11pm

Thanks Matthew.

You were right to be suspicious of the parser. When I comment out only nlp.add_pipe(DependencyParser(nlp.vocab)), I can create a blank model that I can then successfully use in ner.batch-train. It works fine with my vectors. Hope that helps you. I’m guessing I’ve not done enough reading to understand why a dependency parser is or is not needed - I don’t need it for what I am doing anyway.

honnibal · July 10, 2018, 8:57pm

Thanks! We’ll add a test to check that blank parsers don’t misbehave like this.

Topic		Replies	Views
Help with training from scratch english NER model with pretrained Gensim vectors usage , ner , spacy	2	566	January 27, 2022
biomedical nlp models in spacy usage , spacy , solved , gensim	4	2322	February 28, 2018
Problems when saving model with blank NER spacy , solved	6	1532	July 24, 2018
Loading gensim word2vec vectors for terms.teach? usage , terms , solved , gensim	17	4882	August 15, 2018
[E896] on training existing model (NER) usage , ner	1	193	October 10, 2023

Bus Error/Segmentation Fault - Custom Gensim Vectors

Related Topics