batch train buffer full

I have tried in the batch ner.batch-train the few propositions above but it does not work

    # test saving vectors
    vectors = model.vocab.vectors
    for i in range(n_iter):
        losses = model.batch_train(examples, batch_size=batch_size,
                                   drop=dropout, beam_width=beam_width)
        stats = model.evaluate(evals)
        if best is None or stats['acc'] > best[0]:
            model_to_bytes = None
            if output_model is not None:
                # test removing vectors
                model.vocab.vectors = None
                model_to_bytes = model.to_bytes()
                # test adding them back
                model.vocab.vectors = vectors
            best = (stats['acc'], stats, model_to_bytes)
        print_(printers.ner_update(i, losses, stats))
    best_acc, best_stats, best_model = best
    print_(printers.ner_result(best_stats, best_acc, baseline['acc']))
    if output_model is not None:
        # test deep copy trick
        vectors_save = copy.deepcopy(vectors)
        # test putting them back
        model.vocab.vectors = vectors_save
        msg = export_model_data(output_model, model.nlp, examples, evals)
    best_stats['baseline'] = baseline['acc']
    best_stats['acc'] = best_acc
    return best_stats

I did not see how to modify the the ner.teach method yet. Looking.

@idealley It looks like the problem is the deserialization. I didn’t expect to be running into a data size limit on msgpack, since it’s “only” a few GB. But, here we are :(.

The error is coming when we call model.to_bytes(). We serialize to a byte-stream rather than a directory here because we want to avoid making unnecessary writes to disk. We can work-around the situation by replacing this with a call to model.to_disk() and the matching load to model.from_disk()

Ultimately this is a spaCy issue: I need to come up with a different deserialization strategy for the word vectors. What implementation did you use to train them? Would it be possible for you to apply a vocabulary limit, e.g. restricting to 1 or 2 million entries? It might also help to pre-process the text more carefully, as pre-processing artifacts can make the vocabulary much more sparse.

There’s definitely situation where 4 million vectors is desired though, e.g. if you’re using vectors for longer phrases. So we do want to get this fixed in spaCy.

@honnibal, I did not try using the disk yet. I will do that after this message.

I have read a lot about NLP and similar tasks but I did not practice much yet. Therefore to be sure I do things right more or less, I just summarise what I am doing. I used those w2v models: that can be downloaded here

I have tried with a smaller pubmed w2v bin too. I have transformed them both into spacy models as follow:

from gensim.models import KeyedVectors
import spacy

w2v = KeyedVectors.load_word2vec_format('PubMed-w2v.bin', binary=True)
# w2v = KeyedVectors.load_word2vec_format('wikipedia-pubmed-and-PMC-w2v.bin', binary=True)
nlp = spacy.load("en_core_web_sm", vectors=False)

for word in w2v.wv.vocab:
    nlp.vocab.set_vector(word, w2v.wv.word_vec(word))

# nlp.to_disk('wp_pubmed_pmc_w2v') 

The first folder that has been generated is 2.3G the other one 5G

I have prepared a list of patterns that contains abbreviations, single and multi words diseases


I did not yet try to use shapes as per @ines suggestion for the abbreviations

I then tried to annotate a 15k medical abstracts text

{"text": "Severe chronic obstructive pulmonary disease (COPD) is a progressive and debilitating illness characterised by relentless loss of function, intensifying dyspnoea and frequent exacerbations. COPD patients are evidently at increased risk of depression, frailty and death [1, 2]. Predicting individual short-term prognosis and course of events is difficult if not impossible.\n\nAdvance care planning should be part of our clinical routine in severe COPD <>"}
{"text": "The management of idiopathic pulmonary fibrosis (IPF) is complex, as is the process of implementing and assessing a set of quality indicators representing best care practices in IPF by an interstitial lung disease (ILD) programme [1, 2]. To date, there is limited literature documenting the importance of IPF interventions to improve coordination of care, patient engagement in health literacy and education, and understanding what is important to patients [3\u20138]. In 2015, National Jewish Health (NJH) engaged our ILD division healthcare professionals (10 physicians, 4 nurses, 2 medical assistants, 1 physician assistant) and our professional education and biostatistics teams to design and implement a project aimed at measuring key quality indicators and how they may impact clinical practice and IPF patient perception of care.\n\nA successful initiative to improve best care practice in IPF supported by electronic medical record changes <>\n\nThe authors are grateful for the support provided by the interstitial lung disease team at National Jewish Health."}

I ran the following command:

prodigy ner.teach diseases_ner pubmed_w2v journal_abstract_training_data.jsonl --label DISEASE --patterns diseases_terms.jsonl

or with the bigger model, both of them do the buffer exeption (by the way, the near.teach receipe, does not make direct use of the to_bytes() method it is the line 86 EntityRecognizer and I do not know how to override this one as I cannot read the source. (or can I?) )

Then I tried that command with the en, en_core_web_sm and en_core_web_lg

This seems to work a little as my diseases, and abreviations are really well matched. Here the problem is that, in the best case, I could do only around 80 examples and I went as far as 43% in the progression bar. Then prodigy is telling me that there are no more examples. If I restart, I get the same examples (I tried many times, I kind of recognise the articles prodigy shows me now.) But anyway I tried to move forward and I did a batch train.

prodigy ner.batch-train diseases_ner_test3 pubmed_w2v --output diseases --label DISEASE --eval-split 0.2 --n-iter 8 --batch-size 6

as you suggest in the video, I also increased the batch-size as I saw that I could train a little more, but I have very few examples I get things like:

Loaded model en_core_web_lg
Using 20% of accept/reject examples (7) for evaluation
Using 100% of remaining examples (29) for training
Dropout: 0.2  Batch size: 8  Iterations: 8

BEFORE     0.000
Correct    0
Incorrect  7
Entities   14
Unknown    0

#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         11.629     0          7          13         0          0.000
02         8.879      1          6          18         0          0.143
03         10.857     3          4          20         0          0.429
04         7.192      3          4          18         0          0.429
05         8.207      4          3          17         0          0.571
06         6.249      5          2          26         0          0.714
07         5.691      6          1          21         0          0.857
08         4.769      6          1          27         0          0.857

Correct    6
Incorrect  1
Baseline   0.000
Accuracy   0.857

The accuracy is indeed not bad, I have given some text to spacy NER, and it does match my diseases, but the NER model is quite broken as and, the and others are labelled as WORK_OF_ARTS etc.

I have noticed that the en_core models are 300 dimensions vectors, those that I downloaded are 200 would that make a difference? Did I do something wrong? Tank you for your help!


I have tried the nlp.to_disk() it worked a little longer then the EntityReconizer was called and it did the buffer exception

What would be the best approach to have a working prototype for a new label such as disease?

Any news on that?

Just to let you know. I managed to train new entities, with the pubmed w2v. What I did was:

  1. I used the en_core_web_lg to train new entities with a list of medical texts
  2. I did all the steps, exported a model and used it to train a gold entites
  3. I loaded the bin w2v model with Gensim, Saved it as a text document.
  4. I used the spacy init-model with the saved text document
  5. I used the gold entities to train the model with basically the spacy example for batch train.

I did the same with the en_core_lg and I get with both very encouraging results and with more training data, I am sure I can really improve as I see some discrepancies between the 2 differently trained models.

I am thinking to write a blog post about it. Would it interest anyone?

1 Like

@idealley Thanks for updating, and sorry I missed this thread! I actually suspect you might have encountered a bug in a previous version. Are you currently using v1.5.1?

That sounds like a good workflow. One difficult question is always whether to recommend training on top of an existing NER model (such as en_core_web_lg), or whether to recommend starting from a blank one. The existing model might know useful things, but on the other hand it can also be stubborn about the existing entity definitions, and the training data might not correct them. For instance, I think this is why you had that problem with a rare category like WORK_OF_ART. If you never label examples with that label, the model never sees any negative examples of it, so it’s hard for it to learn not to predict it.