prodigy train OutOfMemoryError

dad766 · November 14, 2022, 9:36pm

Hello, I would just like a confirmation.

When training my spancat in GPU mode, it crashes and I get an error message "cupy.cuda.Memory.OutOfMemoryError

I've been browsing the forums and the problem often comes from the size of the batch-size. So I set a batch_size to 1. And after that I get the error :

Is it because my training game contains long conversations (sometimes more than 20 minutes) and that I have to cut each conversation or it has nothing to do with that at all?

This is my command :

python3 -m prodigy train ./training_gpu --my_train_datas --eval-split 0.10 --lang "fr" --gpu-id 0 --label-stats --config ./training/conf.cfg

Here is my conf file:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "fr"
pipeline = ["tok2vec","spancat"]
batch_size = 1
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.spancat.suggester]
@misc = "spacy.ngram_range_suggester.v1"
min_size = 1
max_size = 65

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]
@readers = "prodigy.MergedCorpus.v1"
eval_split = 0.03
sample_size = 1.0
ner = null
textcat = null
textcat_multilabel = null
parser = null
tagger = null
senter = null

[corpora.spancat]
@readers = "prodigy.SpanCatCorpus.v1"
datasets = ["lettria_annotation_initiale"]
eval_datasets = []
spans_key = "sc"

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 100
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

ryanwesslen · November 14, 2022, 10:04pm

hi @dad766!

Yes very likely that's the problem. Have you tried to train with a CPU? Also, can you try to train on a random X docs shortest docs to see if you can still run? This will at least confirm the length of the docs.

One possible option is to try a different suggester function than the n-grams (since it blows up candidates with long spans):

Is there any way you can break up your transcripts? I know sometimes it doesn't have sentences/punctuation, but any simple rules you can do. I think a clever way to segment your data may help your model more than a different suggester though.

As you have more questions on training, you may also want to check out spaCy's discussion forum. There's more posts on optimizing training with spaCy there.

Hope this helps!

dad766 · November 16, 2022, 12:53pm

hi @ryanwesslen,
I cut the conversations into 200 word blocks and I didn't have this problem anymore.
However, I had another problem with memory allocation:

CUDA_ERROR_ILLEGAL_ADDRESS

that I was able to correct by playing with the batch_size parameter. So this is ok.
But the result is bad according to labels_stats.

">

What can justify these results? How is it possible that the precision is always 0 ?

important precision my samples are in utf-8 in French. I wonder if this is not the problem.

Otherwise as a command I run :

python3 -m prodigy train ./training_produit --spancat test_produit --eval-split 0.10 --lang "fr" --gpu-id 0

thanks for your help

ryanwesslen · November 16, 2022, 4:36pm

hi @dad766!

That's tough to say. I'd try to run your model as spans.correct on your existing data first to review examples.

Instead of passing a file as a source, you can pass the name of a dataset along with your trained model:

python3 -m prodigy spans.correct dataset:test_produit my_new_model ...

See if you can run this to get an idea. You may find you need to correct some labels.

Also, how many labels are you using? I suspect it could be you have too few examples.

If you still can't figure it out, I think the spaCy discussion forum may have more insight on GPU tuning. For example, this post discusses balancing GPU memory and performance.

Topic		Replies	Views
any solution for this issue even after i've changed batch size its not working usage , spacy , training , spancat	9	881	June 23, 2022
spancat out of memory training , spancat	3	1040	April 24, 2022
CUDA out of memory, how to decrease the batch size? usage , training	1	1409	December 6, 2021
Train spancat bug spacy , training , spancat	7	557	October 12, 2021
cupy.cuda.memory.OutOfMemoryError problem ner , training	1	971	September 8, 2021

prodigy train OutOfMemoryError

Related topics