Command "ner.batch-train" returns MemoryError

When I run:
prodigy ner.batch-train my_dataset en_core_web_sm --output ./ --n-iter 25 --eval-split 0.2 --dropout 0.2
It returns:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/rohan/.local/lib/python3.5/site-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/usr/local/lib/python3.5/dist-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/usr/local/lib/python3.5/dist-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/rohan/.local/lib/python3.5/site-packages/prodigy/recipes/ner.py", line 602, in batch_train
    examples = list(split_sentences(model.orig_nlp, examples))
  File "cython_src/prodigy/components/preprocess.pyx", line 39, in split_sentences
  File "/home/rohan/.local/lib/python3.5/site-packages/spacy/language.py", line 708, in pipe
    for doc, context in izip(docs, contexts):
  File "/home/rohan/.local/lib/python3.5/site-packages/spacy/language.py", line 736, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 221, in pipe
  File "/home/rohan/.local/lib/python3.5/site-packages/spacy/util.py", line 460, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "nn_parser.pyx", line 226, in pipe
  File "nn_parser.pyx", line 244, in spacy.syntax.nn_parser.Parser.predict
  File "nn_parser.pyx", line 257, in spacy.syntax.nn_parser.Parser.greedy_parse
  File "/home/rohan/.local/lib/python3.5/site-packages/thinc/neural/_classes/model.py", line 169, in __call__
    return self.predict(x)
  File "/home/rohan/.local/lib/python3.5/site-packages/thinc/neural/_classes/model.py", line 133, in predict
    y, _ = self.begin_update(X, drop=None)
  File "_parser_model.pyx", line 214, in spacy.syntax._parser_model.ParserModel.begin_update
  File "_parser_model.pyx", line 262, in spacy.syntax._parser_model.ParserStepModel.__init__
  File "/home/rohan/.local/lib/python3.5/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/rohan/.local/lib/python3.5/site-packages/thinc/api.py", line 295, in begin_update
    X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad), drop=drop)
  File "/home/rohan/.local/lib/python3.5/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/rohan/.local/lib/python3.5/site-packages/thinc/neural/_classes/resnet.py", line 25, in begin_update
    y, bp_y = self._layers[0].begin_update(X, drop=drop)
  File "/home/rohan/.local/lib/python3.5/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/rohan/.local/lib/python3.5/site-packages/thinc/neural/_classes/convolution.py", line 33, in begin_update
    X__bo = self.ops.seq2col(X__bi, self.nW)
  File "ops.pyx", line 557, in thinc.neural.ops.NumpyOps.seq2col
  File "ops.pyx", line 401, in thinc.neural.ops.NumpyOps.allocate
MemoryError

I was reading This Topic and This Another Topic and this doesn't help. I checked numpy, and everything seems fine. Still not sure, why am I not able to train it.

How much memory do you have?

By 'memory', I'm supposing you wanna know the RAM size. I'm using Google Cloud Virtual Machine which contains 8GB of it. I know it's quite less, but I have used the same recipe earlier and it worked like a charm. For now, I just want to batch train for a single label. Not sure, if something's wrong with numpy or my setup after the update. It can be the memory, of course. But what alternate solution we can have here, since I'm unable to upgrade the memory?

(BTW, thanks for a prompt reply.)

Yeah, exactly, I meant the RAM. 8gb should be fine, although 16 are usually better. Did you change anything since training successfully and getting the memory error?

Also, since you also opened that other thread: Command "db-in" returns "MySQL server has gone away" Did you try to import those ~10mb examples into the same dataset? If only one of them succeeded, this could easily explain the memory error.

Yeah! That is the same dataset I'm trying to train. None of them works. It seems like its all happening because of the size of the documents. They're basically the pdf(s), each of them contains 50+ pages. I'm planning to annotate them page-wise. But, I'm afraid, I might loose the important details in the documents due to separation. Will give it a shot. Thanks!

Yes, that explains a lot. And you can always separate them by paragraphs and add an "id" or something similar to the data with a reference to the original document. This way, you always know what belongs together.

The thing is, if your goal is to train an NER model or text classifier with spaCy, there's really nothing you gain from annotating and training on these huge documents. The model is sensitive to the very local context – so if you can't make an annotation decision based on the local context, the model is very unlikely to learn anything meaningful anyways.