CUDA out of memory, how to decrease the batch size?

yzzhong · December 2, 2021, 10:29pm

Hi,

I am running prodigy train and encounter the out of memory issue. I tried to decrease the batch size in the config.cfg but seems it still generated default config?

-> % prodigy train prodigy/model/model_500_fulltext_reviewed2 --ner train_set_full_text500_reviewed2 --base-model en_core_web_trf --eval-split 0.2 --gpu-id 1 --config prodigy/model/config.cfg
2021-12-02 22:18:58.335203: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
ℹ Using GPU: 1

========================= Generating Prodigy config =========================
/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'en_core_web_trf' (3.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.0). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy_transformers/pipeline_component.py:406: UserWarning: Automatically converting a transformer component from spacy-transformers v1.0 to v1.1+. If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spacy-transformers version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'en_core_web_trf' (3.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.0). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'en_core_web_trf' (3.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.0). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy_transformers/pipeline_component.py:406: UserWarning: Automatically converting a transformer component from spacy-transformers v1.0 to v1.1+. If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spacy-transformers version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'en_core_web_trf' (3.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.0). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy_transformers/pipeline_component.py:406: UserWarning: Automatically converting a transformer component from spacy-transformers v1.0 to v1.1+. If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spacy-transformers version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
[2021-12-02 22:19:11,276] [INFO] Set up nlp object from config
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 403 | Evaluation: 100 (20% split)
Training: 390 | Evaluation: 100
Labels: ner (1)
[2021-12-02 22:19:14,106] [INFO] Pipeline: ['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[2021-12-02 22:19:14,106] [INFO] Resuming training for: ['ner', 'transformer']
[2021-12-02 22:19:14,113] [INFO] Created vocabulary
[2021-12-02 22:19:14,138] [INFO] Finished initializing nlp object
[2021-12-02 22:19:14,138] [INFO] Initialized pipeline components: []
✔ Initialized pipeline

============================= Training pipeline =============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 403 | Evaluation: 100 (20% split)
Training: 390 | Evaluation: 100
Labels: ner (1)
ℹ Pipeline: ['transformer', 'tagger', 'parser', 'attribute_ruler',
'lemmatizer', 'ner']
ℹ Frozen components: ['tagger', 'parser', 'attribute_ruler',
'lemmatizer']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  -------------  --------  ------  ------  ------  ------
⚠ Aborting and saving the final best model. Encountered exception:
RuntimeError('CUDA out of memory. Tried to allocate 578.00 MiB (GPU 1; 14.76 GiB
total capacity; 11.46 GiB already allocated; 57.75 MiB free; 12.01 GiB reserved
in total by PyTorch)')
Traceback (most recent call last):
  File "/fn/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/fn/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 331, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/prodigy/recipes/train.py", line 277, in train
    return _train(
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/prodigy/recipes/train.py", line 197, in _train
    spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/training/loop.py", line 122, in train
    raise e
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/training/loop.py", line 105, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/training/loop.py", line 224, in train_while_improving
    score, other_scores = evaluate()
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/training/loop.py", line 281, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/language.py", line 1409, in evaluate
    for doc, eg in zip(
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/util.py", line 1599, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 79, in pipe
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy/util.py", line 1618, in raise_error
    raise e
  File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe
  File "spacy/pipeline/tagger.pyx", line 141, in spacy.pipeline.tagger.Tagger.predict
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/spacy_transformers/layers/transformer_model.py", line 185, in forward
    model_output, bp_tensors = transformer(wordpieces, is_train)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/layers/pytorchwrapper.py", line 134, in forward
    Ytorch, torch_backprop = model.shims[0](Xtorch, is_train)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/shims/pytorch.py", line 56, in __call__
    return self.predict(inputs), lambda a: ...
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/thinc/shims/pytorch.py", line 66, in predict
    outputs = self._model(*inputs.args, **inputs.kwargs)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 798, in forward
    encoder_outputs = self.encoder(
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 498, in forward
    layer_outputs = layer_module(
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 393, in forward
    self_attention_outputs = self.attention(
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 321, in forward
    self_outputs = self.self(
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yzhong/.virtualenvs/research/lib/python3.8/site-packages/transformers/models/roberta/modeling_roberta.py", line 257, in forward
    context_layer = torch.matmul(attention_probs, value_layer)
RuntimeError: CUDA out of memory. Tried to allocate 578.00 MiB (GPU 1; 14.76 GiB total capacity; 11.46 GiB already allocated; 57.75 MiB free; 12.01 GiB reserved in total by PyTorch)

ines · December 6, 2021, 12:36pm

Hi! Just to double-check, are you using the latest version of Prodigy? If you provide a custom config file during training, its settings should be ported over. You can check this by running data-to-spacy and inspecting the config file it exports. Alternatively, you can also run data-to-spacy to generate the train.spacy and dev.spacy corpora and train with spaCy and your custom config directly.

Topic		Replies	Views
prodigy train OutOfMemoryError	3	475	November 16, 2022
Flag --batch-size not recognized by prodigy train spacy , solved , nightly	3	922	May 20, 2021
cupy.cuda.memory.OutOfMemoryError problem ner , training	1	971	September 8, 2021
any solution for this issue even after i've changed batch size its not working usage , spacy , training , spancat	9	881	June 23, 2022
spancat out of memory training , spancat	3	1041	April 24, 2022

CUDA out of memory, how to decrease the batch size?

Related topics