Error on saving model from textcat.batch-train

I'm getting an error after running textcat.batch-train that I think may be related to the size of the starting model I'm using. I'm starting a large (3.3 GB) custom starting model that has aligned word vectors from many languages. Memory usage during training is high (11 GB+) and then I get this error at the end of training when trying to save:

...
Baseline   0.63
Precision  0.89
Recall     0.91
F-score    0.90
Accuracy   0.92
Traceback (most recent call last):
  File "/Users/ahalterman/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/ahalterman/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/prodigy/__main__.py", line 242, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/prodigy/recipes/textcat.py", line 154, in batch_train
    nlp = nlp.from_bytes(best_model)
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/spacy/language.py", line 671, in from_bytes
    msg = util.from_bytes(bytes_data, deserializers, {})
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/spacy/util.py", line 500, in from_bytes
    msg = msgpack.loads(bytes_data, encoding='utf8')
  File "/Users/ahalterman/anaconda3/lib/python3.6/site-packages/msgpack_numpy.py", line 187, in unpackb
    return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
  File "msgpack/_unpacker.pyx", line 139, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:2068)
ValueError: 3378393043 exceeds max_bin_len(2147483647)

@andy Thanks, I didn’t know msgpack had that limit.

As a mitigation, you could try passing vocab=False to the model.to_disk(). You’ll then have to manage the vocab loading separately. If you want to keep loading from a single directory, you could copy it in after saving from the source. Alternatively, you could load the vocab from a single place each time, keeping the model and the vocab/vectors separate.

If you like you might want to make a subclass of English (or whatever other Language) that handles the to/from bytes/disk differently in this way, for convenience.

1 Like