batch train buffer full

I ran the batch train on custom model converted from gensim to spacy on 320k annotated datasets (50% positive and 50% negative) and it took solid 24hrs to complete and then it returned below error before outputting the model :frowning:

Let me know how do i go about it.

command:

nohup python -m prodigy textcat.batch-train followup_report_3M /home/ubuntu/cnn-annotation/InstallPackages/model/pmcmodel/PubMed-and-PMC-w2v-spacy.bin --eval-split 0.2  -n 6 --dropout 0.2 --output followup_report_3M_model_PMC_PUB &

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/cnn-annotation/venv/lib/python3.5/site-packages/prodigy/__main__.py", line 248, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/ubuntu/cnn-annotation/venv/lib/python3.5/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/ubuntu/cnn-annotation/venv/lib/python3.5/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/ubuntu/cnn-annotation/venv/lib/python3.5/site-packages/prodigy/recipes/textcat.py", line 154, in batch_train
    nlp = nlp.from_bytes(best_model)
  File "/home/ubuntu/cnn-annotation/venv/lib/python3.5/site-packages/spacy/language.py", line 680, in from_bytes
    msg = util.from_bytes(bytes_data, deserializers, {})
  File "/home/ubuntu/cnn-annotation/venv/lib/python3.5/site-packages/spacy/util.py", line 501, in from_bytes
    msg = msgpack.loads(bytes_data, encoding='utf8')
  File "/home/ubuntu/cnn-annotation/venv/lib/python3.5/site-packages/msgpack_numpy.py", line 187, in unpackb
    return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
  File "/home/ubuntu/cnn-annotation/venv/lib/python3.5/site-packages/msgpack/fallback.py", line 122, in unpackb
    unpacker.feed(packed)
  File "/home/ubuntu/cnn-annotation/venv/lib/python3.5/site-packages/msgpack/fallback.py", line 291, in feed
    raise BufferFull
msgpack.exceptions.BufferFull

Loaded model /home/ubuntu/cnn-annotation/InstallPackages/model/pmcmodel/PubMed-and-PMC-w2v-spacy.bin
Using 20% of examples (65254) for evaluation
Using 100% of remaining examples (261016) for training
Dropout: 0.2  Batch size: 10  Iterations: 6

#          LOSS       F-SCORE    ACCURACY
01         1006.154   0.968      0.968
02         823.834    0.968      0.968
03         816.626    0.968      0.968
04         807.455    0.969      0.969
05         799.661    0.970      0.969
06         794.458    0.970      0.970

MODEL      USER       COUNT
accept     accept     31645
accept     reject     1169
reject     reject     31636
reject     accept     804


Correct    63281
Incorrect  1973


Baseline   0.50
Precision  0.96
Recall     0.98
F-score    0.97
Accuracy   0.97

is it because of --output instead of --output-model ?

No, that flag is correct. The problem is that the .to_bytes() can't handle the size of the vectors.

I think there's an underlying problem here that's behind the issues in the other thread as well: the vectors and vocab are unexpectedly large. Anyway. Here's quick fix so you can restart the training.

The problem is occurring when we serialize the best current model into bytes, and then try to load it back. When we load it back, msgpack complains the data is too large. That's pretty troubling really -- but as a band-aid, we can tell it not to put the vectors into that byte representation. After all, the vectors are immutable, so there's no need to save them in the check point.

The function below makes some small edits to the textcat.batch_train recipe, to avoid saving the vectors. I've also added two break statements, so that the function should exit after only one training batch. This should allow you to confirm that it completes correctly without waiting 24 hours (I feel your pain, believe me!)

@recipe('textcat.batch-train',
        dataset=recipe_args['dataset'],
        input_model=recipe_args['spacy_model'],
        output_model=recipe_args['output'],
        lang=recipe_args['lang'],
        factor=recipe_args['factor'],
        dropout=recipe_args['dropout'],
        n_iter=recipe_args['n_iter'],
        batch_size=recipe_args['batch_size'],
        eval_id=recipe_args['eval_id'],
        eval_split=recipe_args['eval_split'],
        long_text=("Long text", "flag", "L", bool),
        silent=recipe_args['silent'])
def batch_train(dataset, input_model=None, output_model=None, lang='en',
                factor=1, dropout=0.2, n_iter=10, batch_size=10,
                eval_id=None, eval_split=None, long_text=False, silent=False):
    """
    Batch train a new text classification model from annotations. Prodigy will
    export the best result to the output directory, and include a JSONL file of
    the training and evaluation examples. You can either supply a dataset ID
    containing the evaluation data, or choose to split off a percentage of
    examples for evaluation.
    """
    log("RECIPE: Starting recipe textcat.batch-train", locals())
    DB = connect()
    print_ = get_print(silent)
    random.seed(0)
    if input_model is not None:
        nlp = spacy.load(input_model, disable=['ner'])
        print_('\nLoaded model {}'.format(input_model))
    else:
        nlp = spacy.blank(lang, pipeline=[])
        print_('\nLoaded blank model')
    examples = DB.get_dataset(dataset)
    labels = {eg['label'] for eg in examples}
    labels = list(sorted(labels))
    model = TextClassifier(nlp, labels, long_text=long_text,
                           low_data=len(examples) < 1000)
    log('RECIPE: Initialised TextClassifier with model {}'
        .format(input_model), model.nlp.meta)
    random.shuffle(examples)
    if eval_id:
        evals = DB.get_dataset(eval_id)
        print_("Loaded {} evaluation examples from '{}'"
               .format(len(evals), eval_id))
    else:
        examples, evals, eval_split = split_evals(examples, eval_split)
        print_("Using {}% of examples ({}) for evaluation"
               .format(round(eval_split * 100), len(evals)))
    random.shuffle(examples)
    examples = examples[:int(len(examples) * factor)]
    print_(printers.trainconf(dropout, n_iter, batch_size, factor,
                              len(examples)))
    if len(evals) > 0:
        print_(printers.tc_update_header())
    best_acc = {'accuracy': 0}
    best_model = None
   if long_text:
        examples = list(split_sentences(nlp, examples, min_length=False))
    # Note the vectors in a variable, so we can unset them to serialize the
    # model. The vectors are immutable, so this works out pretty well.
    vectors = nlp.vocab.vectors
    for i in range(n_iter):
        loss = 0.
        random.shuffle(examples)
        for batch in cytoolz.partition_all(batch_size,
                                           tqdm.tqdm(examples, leave=False)):
            batch = list(batch)
            loss += model.update(batch, revise=False, drop=dropout)
            break
        if len(evals) > 0:
            with nlp.use_params(model.optimizer.averages):
                acc = model.evaluate(tqdm.tqdm(evals, leave=False))
                if acc['accuracy'] > best_acc['accuracy']:
                    best_acc = dict(acc)
                    # Avoid saving the vectors.
                    nlp.vocab.vectors = None
                    best_model = nlp.to_bytes()
                    nlp.vocab.vectors = vectors
            print_(printers.tc_update(i, loss, acc))
        break
    if len(evals) > 0:
        print_(printers.tc_result(best_acc))
    if output_model is not None:
        if best_model is not None:
            nlp = nlp.from_bytes(best_model)
            # Put the vectors back.
            nlp.vocab.vectors = vectors
        msg = export_model_data(output_model, nlp, examples, evals)
        print_(msg)
    return best_acc['accuracy']

I am getting the following error:

self._meta['vectors'] = {'width': self.vocab.vectors_length,
  File "vocab.pyx", line 250, in spacy.vocab.Vocab.vectors_length.__get__
AttributeError: 'NoneType' object has no attribute 'data'
Loaded model /home/ubuntu/cnn-annotation/InstallPackages/model/pmcmodel/PubMed-and-PMC-w2v-spacy.bin
Using 20% of examples (65254) for evaluation
Using 100% of remaining examples (261016) for training
Dropout: 0.2  Batch size: 10  Iterations: 1

#          LOSS       F-SCORE    ACCURACY

Maybe it's not! I've been trying to track down the error above. Where's the _meta being set? Is that within Prodigy, spaCy, or your own code?

I was able to run the batch training however when I run the text classification after the model is built I run into the following error. I am unable to proceed to use prodigy for my classification problems.

raise ShapeMismatchError(arg.shape, shape_values, shape)

thinc.exceptions.ShapeMismatchError:

Shape mismatch: input (2, 0) not compatible with [None, 200].

Any workaround for this? I am unable to either perform terms.train or textcat.batchtrain .and unable to proceed further.

@honnibal, is there any workaround ?

@madhujahagirdar
Maybe try:


nlp.tagger.cfg['pretrained_dims'] = nlp.vocab.vectors.data.shape[1]
nlp.vocab.vectors = Vectors()

I get the following error now and also see that vector file size if 128 bytes

-rw-rw-r-- 1 madhujahagirdar madhujahagirdar 128 Mar 12 09:01 vectors

Error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/__init__.py", line 19, in load
    return util.load_model(name, **overrides)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py", line 117, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py", line 159, in load_model_from_path
    return nlp.from_disk(model_path)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/language.py", line 638, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py", line 522, in from_disk
    reader(path / key)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/language.py", line 634, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "pipeline.pyx", line 604, in spacy.pipeline.Tagger.from_disk
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py", line 522, in from_disk
    reader(path / key)
  File "pipeline.pyx", line 586, in spacy.pipeline.Tagger.from_disk.load_model
  File "pipeline.pyx", line 500, in spacy.pipeline.Tagger.Model
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/_ml.py", line 442, in build_tagger_model
    pretrained_dims=pretrained_dims)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/_ml.py", line 272, in Tok2Vec
    glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID))
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/thinc/neural/_classes/static_vectors.py", line 47, in __init__
    "Cannot create vectors table with dimension 0.\n"

I am still stuck and unable to train the models, I would really appreciate any workaround.

Hmm. Try:

nlp.tagger.cfg['pretrained_dims'] = nlp.vocab.vectors.data.shape[1]
nlp.vocab.vectors = Vectors(shape=(1, nlp.tagger.cfg['pretrained_dims']))

This should set it to the same shape as before, without serializing it.

Traceback (most recent call last):
File “”, line 1, in
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/init.py”, line 19, in load
return util.load_model(name, **overrides)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/util.py”, line 117, in load_model
return load_model_from_path(Path(name), **overrides)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/util.py”, line 159, in load_model_from_path
return nlp.from_disk(model_path)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/language.py”, line 638, in from_disk
util.from_disk(path, deserializers, exclude)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/util.py”, line 522, in from_disk
reader(path / key)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/language.py”, line 634, in
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
File “pipeline.pyx”, line 604, in spacy.pipeline.Tagger.from_disk
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/util.py”, line 522, in from_disk
reader(path / key)
File “pipeline.pyx”, line 587, in spacy.pipeline.Tagger.from_disk.load_model
File “pipeline.pyx”, line 588, in spacy.pipeline.Tagger.from_disk.load_model
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 351, in from_bytes
dest = getattr(layer, name)
AttributeError: ‘FunctionLayer’ object has no attribute ‘vectors’

I tried the above config and get the above error and the vector size is 928 bytes

-rw-rw-r-- 1 madhujahagirdar madhujahagirdar 928 Mar 15 13:06 vectors

@honnibal , Finally I was able to make it work , look at the below code and let me know if its ok. I added deepcopy(vector) as it was getting reset from best_model. If its ok, I have a path now to continue.

However, one clarification I have is, as we are saving vector and putting it back, is the vector updated with the learning it has or it is static from the word2vec model ? or should we move the

nlp.vocab.vectors = vectors before nlp.vocab.vectors = Vectors()

so that we can save the updated vector

#Save the Vectors
vectors = nlp.vocab.vectors
    print("lenght of vectors is ", len(vectors))
    for i in range(n_iter):
        loss = 0.
        random.shuffle(examples)
        for batch in cytoolz.partition_all(batch_size,
                                           tqdm.tqdm(examples, leave=False)):
            batch = list(batch)
            loss += model.update(batch, revise=False, drop=dropout)
        if len(evals) > 0:
            with nlp.use_params(model.optimizer.averages):
                acc = model.evaluate(tqdm.tqdm(evals, leave=False))
                if acc['accuracy'] > best_acc['accuracy']:
                    best_acc = dict(acc)
                    nlp.vocab.vectors = Vectors()
                    best_model = nlp.to_bytes()
                    nlp.vocab.vectors = vectors
            print_(printers.tc_update(i, loss, acc))
    if len(evals) > 0:
        print_(printers.tc_result(best_acc))
    if output_model is not None:
        if best_model is not None:
            #I had to do this, as nlp.from_bytes was resetting vector to 0 length. This works ok now
            vectors_save = deepcopy(vectors)
            nlp = nlp.from_bytes(best_model)
            nlp.vocab.vectors = vectors_save
        msg = export_model_data(output_model, nlp, examples, evals)
        print_(msg)
    return best_acc['accuracy']

Glad we found a work-around! I hope we can fix the underlying problems in the next spaCy update.

The pre-trained vectors are static. The model separately has internal vectors which are learned from the data. It then concatenates the learned vectors with the static ones and condenses them with a hidden layer, to produce the output.

This means you don't have to worry about saving the vectors on each epoch -- so long as you put the static vectors back in, it should be fine. Problems will occur if you run with different vectors from how the model was trained --- then the model will have different features, and you'll get bad results.

Awesome ! Thanks for your support and patience.

@honnibal, this method worked for the single label. When I am using multi-label classification I running to following error, any idea what I need to change.

~/cnn-annotation/venv/lib/python3.5/site-packages/spacy/language.py in __call__(self, text, disable)
    339             if name in disable:
    340                 continue
--> 341             doc = proc(doc)
    342         return doc
    343 

nn_parser.pyx in spacy.syntax.nn_parser.Parser.__call__()

nn_parser.pyx in spacy.syntax.nn_parser.Parser.parse_batch()

nn_parser.pyx in spacy.syntax.nn_parser.Parser.get_batch_model()

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in begin_update(self, X, drop)
     59         callbacks = []
     60         for layer in self._layers:
---> 61             X, inc_layer_grad = layer.begin_update(X, drop=drop)
     62             callbacks.append(inc_layer_grad)
     63         def continue_update(gradient, sgd=None):

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in begin_update(seqs_in, drop)
    278         lengths = layer.ops.asarray([len(seq) for seq in seqs_in])
    279         X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad),
--> 280                                          drop=drop)
    281         if bp_layer is None:
    282             return layer.ops.unflatten(X, lengths, pad=pad), None

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in begin_update(self, X, drop)
     59         callbacks = []
     60         for layer in self._layers:
---> 61             X, inc_layer_grad = layer.begin_update(X, drop=drop)
     62             callbacks.append(inc_layer_grad)
     63         def continue_update(gradient, sgd=None):

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in uniqued_fwd(X, drop)
    372                                                     return_counts=True)
    373         X_uniq = layer.ops.xp.ascontiguousarray(X[ind])
--> 374         Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
    375         Y = Y_uniq[inv].reshape((X.shape[0],) + Y_uniq.shape[1:])
    376         def uniqued_bwd(dY, sgd=None):

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in begin_update(self, X, drop)
     59         callbacks = []
     60         for layer in self._layers:
---> 61             X, inc_layer_grad = layer.begin_update(X, drop=drop)
     62             callbacks.append(inc_layer_grad)
     63         def continue_update(gradient, sgd=None):

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/neural/_classes/layernorm.py in begin_update(self, X, drop)
     49 
     50     def begin_update(self, X, drop=0.):
---> 51         X, backprop_child = self.child.begin_update(X, drop=0.)
     52         N, mu, var = _get_moments(self.ops, X)
     53 

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/neural/_classes/maxout.py in begin_update(self, X__bi, drop)
     67         W = self.W.reshape((self.nO * self.nP, self.nI))
     68         drop *= self.drop_factor
---> 69         output__boc = self.ops.batch_dot(X__bi, W)
     70         output__boc += self.b.reshape((self.nO*self.nP,))
     71         output__boc = output__boc.reshape((output__boc.shape[0], self.nO, self.nP))

ops.pyx in thinc.neural.ops.NumpyOps.batch_dot()

ValueError: shapes (8,512) and (640,384) not aligned: 512 (dim 1) != 640 (dim 0)

@honnibal Just to let you know that I have the same error with ner.teach. I have a custom w2v model I have tried with two jsonl text data one with approximatively 15K abstracts and a smaller one with 2K. I then found this thread.

I am wondering how I can do to teach my custom label.

Here is the error I got:

$ prodigy ner.teach diseases_ner pubmed_word2vec journal_abstract_training_data_small.jsonl --label DISEASE --patterns diseases_terms.jsonl
Using 1 labels: DISEASE
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/prodigy/__main__.py", line 254, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 152, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 86, in teach
    model = EntityRecognizer(nlp, label=label)
  File "cython_src/prodigy/models/ner.pyx", line 160, in prodigy.models.ner.EntityRecognizer.__init__
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 274, in _reconstruct
    y = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 273, in <genexpr>
    args = (deepcopy(arg, memo) for arg in args)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 274, in _reconstruct
    y = func(*args)
  File "vectors.pyx", line 24, in spacy.vectors.unpickle_vectors
  File "vectors.pyx", line 428, in spacy.vectors.Vectors.from_bytes
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/spacy/util.py", line 490, in from_bytes
    msg = msgpack.loads(bytes_data, encoding='utf8')
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/msgpack_numpy.py", line 187, in unpackb
    return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/msgpack/fallback.py", line 122, in unpackb
    unpacker.feed(packed)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/msgpack/fallback.py", line 291, in feed
    raise BufferFull
msgpack.exceptions.BufferFull

My Spacy Model based on this Word2Vec has around 4M+ words

same thing with ner.batch-train