batch train buffer full

@honnibal, is there any workaround ?

@madhujahagirdar
Maybe try:


nlp.tagger.cfg['pretrained_dims'] = nlp.vocab.vectors.data.shape[1]
nlp.vocab.vectors = Vectors()

I get the following error now and also see that vector file size if 128 bytes

-rw-rw-r-- 1 madhujahagirdar madhujahagirdar 128 Mar 12 09:01 vectors

Error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/__init__.py", line 19, in load
    return util.load_model(name, **overrides)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py", line 117, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py", line 159, in load_model_from_path
    return nlp.from_disk(model_path)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/language.py", line 638, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py", line 522, in from_disk
    reader(path / key)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/language.py", line 634, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "pipeline.pyx", line 604, in spacy.pipeline.Tagger.from_disk
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/util.py", line 522, in from_disk
    reader(path / key)
  File "pipeline.pyx", line 586, in spacy.pipeline.Tagger.from_disk.load_model
  File "pipeline.pyx", line 500, in spacy.pipeline.Tagger.Model
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/_ml.py", line 442, in build_tagger_model
    pretrained_dims=pretrained_dims)
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/spacy/_ml.py", line 272, in Tok2Vec
    glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID))
  File "/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/thinc/neural/_classes/static_vectors.py", line 47, in __init__
    "Cannot create vectors table with dimension 0.\n"

I am still stuck and unable to train the models, I would really appreciate any workaround.

Hmm. Try:

nlp.tagger.cfg['pretrained_dims'] = nlp.vocab.vectors.data.shape[1]
nlp.vocab.vectors = Vectors(shape=(1, nlp.tagger.cfg['pretrained_dims']))

This should set it to the same shape as before, without serializing it.

Traceback (most recent call last):
File “”, line 1, in
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/init.py”, line 19, in load
return util.load_model(name, **overrides)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/util.py”, line 117, in load_model
return load_model_from_path(Path(name), **overrides)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/util.py”, line 159, in load_model_from_path
return nlp.from_disk(model_path)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/language.py”, line 638, in from_disk
util.from_disk(path, deserializers, exclude)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/util.py”, line 522, in from_disk
reader(path / key)
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/language.py”, line 634, in
deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
File “pipeline.pyx”, line 604, in spacy.pipeline.Tagger.from_disk
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/util.py”, line 522, in from_disk
reader(path / key)
File “pipeline.pyx”, line 587, in spacy.pipeline.Tagger.from_disk.load_model
File “pipeline.pyx”, line 588, in spacy.pipeline.Tagger.from_disk.load_model
File “/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 351, in from_bytes
dest = getattr(layer, name)
AttributeError: ‘FunctionLayer’ object has no attribute ‘vectors’

I tried the above config and get the above error and the vector size is 928 bytes

-rw-rw-r-- 1 madhujahagirdar madhujahagirdar 928 Mar 15 13:06 vectors

@honnibal , Finally I was able to make it work , look at the below code and let me know if its ok. I added deepcopy(vector) as it was getting reset from best_model. If its ok, I have a path now to continue.

However, one clarification I have is, as we are saving vector and putting it back, is the vector updated with the learning it has or it is static from the word2vec model ? or should we move the

nlp.vocab.vectors = vectors before nlp.vocab.vectors = Vectors()

so that we can save the updated vector

#Save the Vectors
vectors = nlp.vocab.vectors
    print("lenght of vectors is ", len(vectors))
    for i in range(n_iter):
        loss = 0.
        random.shuffle(examples)
        for batch in cytoolz.partition_all(batch_size,
                                           tqdm.tqdm(examples, leave=False)):
            batch = list(batch)
            loss += model.update(batch, revise=False, drop=dropout)
        if len(evals) > 0:
            with nlp.use_params(model.optimizer.averages):
                acc = model.evaluate(tqdm.tqdm(evals, leave=False))
                if acc['accuracy'] > best_acc['accuracy']:
                    best_acc = dict(acc)
                    nlp.vocab.vectors = Vectors()
                    best_model = nlp.to_bytes()
                    nlp.vocab.vectors = vectors
            print_(printers.tc_update(i, loss, acc))
    if len(evals) > 0:
        print_(printers.tc_result(best_acc))
    if output_model is not None:
        if best_model is not None:
            #I had to do this, as nlp.from_bytes was resetting vector to 0 length. This works ok now
            vectors_save = deepcopy(vectors)
            nlp = nlp.from_bytes(best_model)
            nlp.vocab.vectors = vectors_save
        msg = export_model_data(output_model, nlp, examples, evals)
        print_(msg)
    return best_acc['accuracy']

Glad we found a work-around! I hope we can fix the underlying problems in the next spaCy update.

The pre-trained vectors are static. The model separately has internal vectors which are learned from the data. It then concatenates the learned vectors with the static ones and condenses them with a hidden layer, to produce the output.

This means you don't have to worry about saving the vectors on each epoch -- so long as you put the static vectors back in, it should be fine. Problems will occur if you run with different vectors from how the model was trained --- then the model will have different features, and you'll get bad results.

Awesome ! Thanks for your support and patience.

@honnibal, this method worked for the single label. When I am using multi-label classification I running to following error, any idea what I need to change.

~/cnn-annotation/venv/lib/python3.5/site-packages/spacy/language.py in __call__(self, text, disable)
    339             if name in disable:
    340                 continue
--> 341             doc = proc(doc)
    342         return doc
    343 

nn_parser.pyx in spacy.syntax.nn_parser.Parser.__call__()

nn_parser.pyx in spacy.syntax.nn_parser.Parser.parse_batch()

nn_parser.pyx in spacy.syntax.nn_parser.Parser.get_batch_model()

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in begin_update(self, X, drop)
     59         callbacks = []
     60         for layer in self._layers:
---> 61             X, inc_layer_grad = layer.begin_update(X, drop=drop)
     62             callbacks.append(inc_layer_grad)
     63         def continue_update(gradient, sgd=None):

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in begin_update(seqs_in, drop)
    278         lengths = layer.ops.asarray([len(seq) for seq in seqs_in])
    279         X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad),
--> 280                                          drop=drop)
    281         if bp_layer is None:
    282             return layer.ops.unflatten(X, lengths, pad=pad), None

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in begin_update(self, X, drop)
     59         callbacks = []
     60         for layer in self._layers:
---> 61             X, inc_layer_grad = layer.begin_update(X, drop=drop)
     62             callbacks.append(inc_layer_grad)
     63         def continue_update(gradient, sgd=None):

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in uniqued_fwd(X, drop)
    372                                                     return_counts=True)
    373         X_uniq = layer.ops.xp.ascontiguousarray(X[ind])
--> 374         Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
    375         Y = Y_uniq[inv].reshape((X.shape[0],) + Y_uniq.shape[1:])
    376         def uniqued_bwd(dY, sgd=None):

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/api.py in begin_update(self, X, drop)
     59         callbacks = []
     60         for layer in self._layers:
---> 61             X, inc_layer_grad = layer.begin_update(X, drop=drop)
     62             callbacks.append(inc_layer_grad)
     63         def continue_update(gradient, sgd=None):

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/neural/_classes/layernorm.py in begin_update(self, X, drop)
     49 
     50     def begin_update(self, X, drop=0.):
---> 51         X, backprop_child = self.child.begin_update(X, drop=0.)
     52         N, mu, var = _get_moments(self.ops, X)
     53 

~/cnn-annotation/venv/lib/python3.5/site-packages/thinc/neural/_classes/maxout.py in begin_update(self, X__bi, drop)
     67         W = self.W.reshape((self.nO * self.nP, self.nI))
     68         drop *= self.drop_factor
---> 69         output__boc = self.ops.batch_dot(X__bi, W)
     70         output__boc += self.b.reshape((self.nO*self.nP,))
     71         output__boc = output__boc.reshape((output__boc.shape[0], self.nO, self.nP))

ops.pyx in thinc.neural.ops.NumpyOps.batch_dot()

ValueError: shapes (8,512) and (640,384) not aligned: 512 (dim 1) != 640 (dim 0)

@honnibal Just to let you know that I have the same error with ner.teach. I have a custom w2v model I have tried with two jsonl text data one with approximatively 15K abstracts and a smaller one with 2K. I then found this thread.

I am wondering how I can do to teach my custom label.

Here is the error I got:

$ prodigy ner.teach diseases_ner pubmed_word2vec journal_abstract_training_data_small.jsonl --label DISEASE --patterns diseases_terms.jsonl
Using 1 labels: DISEASE
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/prodigy/__main__.py", line 254, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 152, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 86, in teach
    model = EntityRecognizer(nlp, label=label)
  File "cython_src/prodigy/models/ner.pyx", line 160, in prodigy.models.ner.EntityRecognizer.__init__
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 274, in _reconstruct
    y = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 273, in <genexpr>
    args = (deepcopy(arg, memo) for arg in args)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/copy.py", line 274, in _reconstruct
    y = func(*args)
  File "vectors.pyx", line 24, in spacy.vectors.unpickle_vectors
  File "vectors.pyx", line 428, in spacy.vectors.Vectors.from_bytes
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/spacy/util.py", line 490, in from_bytes
    msg = msgpack.loads(bytes_data, encoding='utf8')
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/msgpack_numpy.py", line 187, in unpackb
    return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/msgpack/fallback.py", line 122, in unpackb
    unpacker.feed(packed)
  File "/Users/samuel/Projects/prodigy/.env/lib/python3.6/site-packages/msgpack/fallback.py", line 291, in feed
    raise BufferFull
msgpack.exceptions.BufferFull

My Spacy Model based on this Word2Vec has around 4M+ words

same thing with ner.batch-train

I have tried in the batch ner.batch-train the few propositions above but it does not work

    # test saving vectors
    vectors = model.vocab.vectors
    for i in range(n_iter):
        losses = model.batch_train(examples, batch_size=batch_size,
                                   drop=dropout, beam_width=beam_width)
        stats = model.evaluate(evals)
        if best is None or stats['acc'] > best[0]:
            model_to_bytes = None
            if output_model is not None:
                # test removing vectors
                model.vocab.vectors = None
                model_to_bytes = model.to_bytes()
                # test adding them back
                model.vocab.vectors = vectors
            best = (stats['acc'], stats, model_to_bytes)
        print_(printers.ner_update(i, losses, stats))
    best_acc, best_stats, best_model = best
    print_(printers.ner_result(best_stats, best_acc, baseline['acc']))
    if output_model is not None:
        # test deep copy trick
        vectors_save = copy.deepcopy(vectors)
        model.from_bytes(best_model)
        # test putting them back
        model.vocab.vectors = vectors_save
        msg = export_model_data(output_model, model.nlp, examples, evals)
        print_(msg)
    best_stats['baseline'] = baseline['acc']
    best_stats['acc'] = best_acc
    return best_stats

I did not see how to modify the the ner.teach method yet. Looking.

@idealley It looks like the problem is the deserialization. I didn’t expect to be running into a data size limit on msgpack, since it’s “only” a few GB. But, here we are :(.

The error is coming when we call model.to_bytes(). We serialize to a byte-stream rather than a directory here because we want to avoid making unnecessary writes to disk. We can work-around the situation by replacing this with a call to model.to_disk() and the matching load to model.from_disk()

Ultimately this is a spaCy issue: I need to come up with a different deserialization strategy for the word vectors. What implementation did you use to train them? Would it be possible for you to apply a vocabulary limit, e.g. restricting to 1 or 2 million entries? It might also help to pre-process the text more carefully, as pre-processing artifacts can make the vocabulary much more sparse.

There’s definitely situation where 4 million vectors is desired though, e.g. if you’re using vectors for longer phrases. So we do want to get this fixed in spaCy.

@honnibal, I did not try using the disk yet. I will do that after this message.

I have read a lot about NLP and similar tasks but I did not practice much yet. Therefore to be sure I do things right more or less, I just summarise what I am doing. I used those w2v models: http://bio.nlplab.org/ that can be downloaded here http://evexdb.org/pmresources/vec-space-models/

I have tried with a smaller pubmed w2v bin too. I have transformed them both into spacy models as follow:

from gensim.models import KeyedVectors
import spacy

w2v = KeyedVectors.load_word2vec_format('PubMed-w2v.bin', binary=True)
# w2v = KeyedVectors.load_word2vec_format('wikipedia-pubmed-and-PMC-w2v.bin', binary=True)
nlp = spacy.load("en_core_web_sm", vectors=False)

for word in w2v.wv.vocab:
    nlp.vocab.set_vector(word, w2v.wv.word_vec(word))

nlp.to_disk('pubmed_w2v')
# nlp.to_disk('wp_pubmed_pmc_w2v') 

The first folder that has been generated is 2.3G the other one 5G

I have prepared a list of patterns that contains abbreviations, single and multi words diseases

{"label":"DISEASE","pattern":[{"lower":"asthma"}]}
{"label":"DISEASE","pattern":[{"lower":"acute"},{"lower":"bronchitis"}]}
{"label":"DISEASE","pattern":[{"lower":"acute"},{"lower":"respiratory"},{"lower":"distress"},{"lower":"syndrome"}]}
{"label":"DISEASE","pattern":[{"lower":"ards"}]}

I did not yet try to use shapes as per @ines suggestion for the abbreviations

I then tried to annotate a 15k medical abstracts text

{"text": "Severe chronic obstructive pulmonary disease (COPD) is a progressive and debilitating illness characterised by relentless loss of function, intensifying dyspnoea and frequent exacerbations. COPD patients are evidently at increased risk of depression, frailty and death [1, 2]. Predicting individual short-term prognosis and course of events is difficult if not impossible.\n\nAdvance care planning should be part of our clinical routine in severe COPD <http://ow.ly/Cshs30i8FS9>"}
{"text": "The management of idiopathic pulmonary fibrosis (IPF) is complex, as is the process of implementing and assessing a set of quality indicators representing best care practices in IPF by an interstitial lung disease (ILD) programme [1, 2]. To date, there is limited literature documenting the importance of IPF interventions to improve coordination of care, patient engagement in health literacy and education, and understanding what is important to patients [3\u20138]. In 2015, National Jewish Health (NJH) engaged our ILD division healthcare professionals (10 physicians, 4 nurses, 2 medical assistants, 1 physician assistant) and our professional education and biostatistics teams to design and implement a project aimed at measuring key quality indicators and how they may impact clinical practice and IPF patient perception of care.\n\nA successful initiative to improve best care practice in IPF supported by electronic medical record changes <http://ow.ly/ORxi30hBEmy>\n\nThe authors are grateful for the support provided by the interstitial lung disease team at National Jewish Health."}

I ran the following command:

prodigy ner.teach diseases_ner pubmed_w2v journal_abstract_training_data.jsonl --label DISEASE --patterns diseases_terms.jsonl

or with the bigger model, both of them do the buffer exeption (by the way, the near.teach receipe, does not make direct use of the to_bytes() method it is the line 86 EntityRecognizer and I do not know how to override this one as I cannot read the source. (or can I?) )

Then I tried that command with the en, en_core_web_sm and en_core_web_lg

This seems to work a little as my diseases, and abreviations are really well matched. Here the problem is that, in the best case, I could do only around 80 examples and I went as far as 43% in the progression bar. Then prodigy is telling me that there are no more examples. If I restart, I get the same examples (I tried many times, I kind of recognise the articles prodigy shows me now.) But anyway I tried to move forward and I did a batch train.

prodigy ner.batch-train diseases_ner_test3 pubmed_w2v --output diseases --label DISEASE --eval-split 0.2 --n-iter 8 --batch-size 6

as you suggest in the video, I also increased the batch-size as I saw that I could train a little more, but I have very few examples I get things like:

Loaded model en_core_web_lg
Using 20% of accept/reject examples (7) for evaluation
Using 100% of remaining examples (29) for training
Dropout: 0.2  Batch size: 8  Iterations: 8


BEFORE     0.000
Correct    0
Incorrect  7
Entities   14
Unknown    0


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         11.629     0          7          13         0          0.000
02         8.879      1          6          18         0          0.143
03         10.857     3          4          20         0          0.429
04         7.192      3          4          18         0          0.429
05         8.207      4          3          17         0          0.571
06         6.249      5          2          26         0          0.714
07         5.691      6          1          21         0          0.857
08         4.769      6          1          27         0          0.857

Correct    6
Incorrect  1
Baseline   0.000
Accuracy   0.857

The accuracy is indeed not bad, I have given some text to spacy NER, and it does match my diseases, but the NER model is quite broken as and, the and others are labelled as WORK_OF_ARTS etc.

I have noticed that the en_core models are 300 dimensions vectors, those that I downloaded are 200 would that make a difference? Did I do something wrong? Tank you for your help!

Sam

I have tried the nlp.to_disk() it worked a little longer then the EntityReconizer was called and it did the buffer exception

What would be the best approach to have a working prototype for a new label such as disease?

Any news on that?

Just to let you know. I managed to train new entities, with the pubmed w2v. What I did was:

  1. I used the en_core_web_lg to train new entities with a list of medical texts
  2. I did all the steps, exported a model and used it to train a gold entites
  3. I loaded the bin w2v model with Gensim, Saved it as a text document.
  4. I used the spacy init-model with the saved text document
  5. I used the gold entities to train the model with basically the spacy example for batch train.

I did the same with the en_core_lg and I get with both very encouraging results and with more training data, I am sure I can really improve as I see some discrepancies between the 2 differently trained models.

I am thinking to write a blog post about it. Would it interest anyone?

1 Like

@idealley Thanks for updating, and sorry I missed this thread! I actually suspect you might have encountered a bug in a previous version. Are you currently using v1.5.1?

That sounds like a good workflow. One difficult question is always whether to recommend training on top of an existing NER model (such as en_core_web_lg), or whether to recommend starting from a blank one. The existing model might know useful things, but on the other hand it can also be stubborn about the existing entity definitions, and the training data might not correct them. For instance, I think this is why you had that problem with a rare category like WORK_OF_ART. If you never label examples with that label, the model never sees any negative examples of it, so it’s hard for it to learn not to predict it.