Intermittent segmentation fault with prodigy ner.batch_train ner_product en_core_web_sm --output vuln_model --n-iter 10 --eval-split 0.2 --dropout 0.2
when ner_product has about 5k examples taken from malware vulnerability reports
Yes. Could be memory issues. I see two python 3.5 processes, one with about 9Gb one with about 5Gb, and the machine has 16Gb RAM (mac OS Sierra 10.12.6)
That’s rather more memory than we should be using, but it’ sort of unsurprising.
Try to set your batch size lower? There are a number of places where we accumulate a batch out of the stream, so we can end up with some multiple of the batch-size in memory.
Batch size was 32, and running for 40 iterations finished up with a memory image of 29Gb.
So it looks like some kind of systematic memory leak to me.
But it did complete without segfaulting this time.
Found a memory leak in the beam parsing code that looks relevant. The ParserBeam has a destructor that was supposed to clean up the state objects, but I think there’s a cycle and it’s never collected. Additionally, if it is collected, it looks like it would free the state objects too soon — which would explain the occasional segfaults.
It could be that I’ve fixed a different error, and the bug you’ve hit still persist. But I’ll mark this resolved, because I do think the one I’ve fixed is the relevant one.
I am also getting this error during ner batch-training (using prodigy version 1.5.1) . The bigger my annotation dataset, the more often the error seems to occur.
Suffering as well. The bigger my dataset goes, the the more frequently segmentation fault occurs. At this point with 11000 examples on my NER dataset and 16GB memory, I cannot train anything!
Fatal Python error: GC object already tracked
Thread 0x00007f4232a22700 (most recent call first):
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/threading.py", line 299 in wait
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/threading.py", line 551 in wait
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/site-packages/tqdm/_monitor.py", line 68 in run
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/threading.py", line 884 in _bootstrap
Current thread 0x00007f429e6c9080 (most recent call first):
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 442 in batch_train
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/site-packages/plac_core.py", line 207 in consume
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/site-packages/plac_core.py", line 328 in call
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/site-packages/prodigy/__main__.py", line 259 in <module>
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/runpy.py", line 85 in _run_code
File "/home/aniruddha/.pyenv/versions/3.6.5/lib/python3.6/runpy.py", line 193 in _run_module_as_main
[1] 17112 abort (core dumped) python -m prodigy ner.batch-train .....
@aniruddha Try setting a lower batch size, and perhaps also a lower beam-width. We’re working on addressing these memory-usage issues within spaCy. Thanks for your patience – I know it’s not very satisfying!