Memory Error

Hi again,

We’ve recently spun up a new VM to train a spacy model on a new set of annotations collected using ner.teach and ner.manual (written to the same dataset). We’re experiencing some sort of memory error regarding storing the model as a bytes object. I’m wondering if you have any insight on if this is a config issue within spacy or something within the environment.

The output from the training is below. We’ve tried a couple times, with varying parameters (n=10 in one case). The model gets through each of the iterations (so, all the calculation is completed without issue) but errors out when writing the model to the specified output directory. Any ideas?

newvm@Spacy:~/ner/annotations/0216$ python3 -m prodigy ner.batch-train ner_set_0216 en_core_web_lg -n 20 --output /home/bdsdev/ner/models/model_3
Using 100% of remaining examples (1082) for training
Dropout: 0.2 Batch size: 32 Iterations: 20

BEFORE 0.332
Correct 476
Incorrect 957
Entities 1578
Unknown 654

LOSS RIGHT WRONG ENTS SKIP ACCURACY

01 16.588 572 525 1073 0 0.521
02 14.218 677 418 1033 0 0.618
03 13.778 698 376 1027 0 0.650
04 13.385 737 333 1062 0 0.689
05 12.826 752 316 1087 0 0.704
06 12.252 757 300 1076 0 0.716
07 11.525 756 304 1101 0 0.713
08 11.006 758 297 1152 0 0.718
09 11.047 768 285 1139 0 0.729
10 10.116 769 280 1181 0 0.733
11 9.520 777 274 1201 0 0.739
12 9.517 768 282 1263 0 0.731
13 9.180 768 273 1323 0 0.738
14 8.783 770 281 1361 0 0.733
15 8.852 755 290 1376 0 0.722
16 8.650 763 283 1298 0 0.729
17 8.373 773 276 1279 0 0.737
18 8.514 768 273 1231 0 0.738
19 8.123 763 271 1306 0 0.738
20 7.912 771 265 1302 0 0.744

Correct 771
Incorrect 265
Baseline 0.332
Accuracy 0.744
Traceback (most recent call last):
File “/usr/lib/python3.5/runpy.py”, line 184, in _run_module_as_main
main”, mod_spec)
File “/usr/lib/python3.5/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/bdsdev/.local/lib/python3.5/site-packages/prodigy/main.py”, line 248, in
controller = recipe(args, use_plac=True)
File “cython_src/prodigy/core.pyx”, line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “/home/bdsdev/.local/lib/python3.5/site-packages/plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “/home/bdsdev/.local/lib/python3.5/site-packages/plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “/home/bdsdev/.local/lib/python3.5/site-packages/prodigy/recipes/ner.py”, line 376, in batch_train
model.from_bytes(best_model)
File “cython_src/prodigy/models/ner.pyx”, line 393, in prodigy.models.ner.EntityRecognizer.from_bytes
File “/home/bdsdev/.local/lib/python3.5/site-packages/spacy/language.py”, line 679, in from_bytes
msg = util.from_bytes(bytes_data, deserializers, {})
File “/home/bdsdev/.local/lib/python3.5/site-packages/spacy/util.py”, line 503, in from_bytes
setter(msg[key])
File “/home/bdsdev/.local/lib/python3.5/site-packages/spacy/language.py”, line 669, in
(‘vocab’, lambda b: self.vocab.from_bytes(b)),
File “vocab.pyx”, line 423, in spacy.vocab.Vocab.from_bytes
File “/home/bdsdev/.local/lib/python3.5/site-packages/spacy/util.py”, line 503, in from_bytes
setter(msg[key])
File “vocab.pyx”, line 421, in spacy.vocab.Vocab.from_bytes.lambda4
File “vocab.pyx”, line 417, in spacy.vocab.Vocab.from_bytes.serialize_vectors
File “vectors.pyx”, line 408, in spacy.vectors.Vectors.from_bytes
File “/home/bdsdev/.local/lib/python3.5/site-packages/spacy/util.py”, line 503, in from_bytes
setter(msg[key])
File “vectors.pyx”, line 402, in spacy.vectors.Vectors.from_bytes.deserialize_weights
File “/home/bdsdev/.local/lib/python3.5/site-packages/msgpack_numpy.py”, line 187, in unpackb
return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
File “/home/bdsdev/.local/lib/python3.5/site-packages/msgpack/fallback.py”, line 124, in unpackb
ret = unpacker._unpack()
File “/home/bdsdev/.local/lib/python3.5/site-packages/msgpack/fallback.py”, line 600, in _unpack
ret[key] = self._unpack(EX_CONSTRUCT)
File “/home/bdsdev/.local/lib/python3.5/site-packages/msgpack/fallback.py”, line 617, in _unpack
return bytes(obj)
MemoryError

I suspect the VM might be out of ram? nlp.to_bytes() with the lg model makes a pretty big string, because of all the word vectors. If msgpack makes a copy of the string to read it in, and the model already has the vectors loaded before it’s deserializing the model, we end up multiple copies of this data, which would eat up the memory fast.

Maybe try changing the call to model.to_bytes() to model.nlp.to_bytes(disable=['vocab'])? Then change model.from_bytes() to model.nlp.from_bytes().

You can also just make the VM bigger, of course – but obviously it’s nice to keep costs lower.

What about using a swap file? (see https://wiki.archlinux.org/index.php/swap). Using a swap of 5G fixed some MemoryErrors during setup on a small demo VM. I know it will be slower but maybe it’s a simple solution in order to save money for the beginning?

1 Like


with 2500 expressions on a french model we can see an increase of RAM used