I am trying to train a span category and it runs out of memory before any actual training starts. I am using the latest version of spacy 3.2.3 and prodigy 1.11.7. Additionally I have install GPU support (cu113). I have annotated a set of 240 span cats and when I run the training using this command:
My system has 64GB of RAM with 32GB of swap. GPU has 24GB (RTX 3090).
prodigy train ./prod_span_models --spancat test_dataset --base-model en_core_web_lg --eval-split 0.20 --label-stats --verbose
I end up with the following error:
File "/home/mwade/PycharmProjects/DRD_MultiCat_Model/venv/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/mwade/PycharmProjects/DRD_MultiCat_Model/venv/lib/python3.9/site-packages/thinc/model.py", line 291, in call
return self._func(self, X, is_train=is_train)
File "/home/mwade/PycharmProjects/DRD_MultiCat_Model/venv/lib/python3.9/site-packages/spacy/ml/extract_spans.py", line 32, in forward
Y = Ragged(X.dataXd[indices], spans.dataXd[:, 1] - spans.dataXd[:, 0]) # type: ignore[arg-type, index]
numpy.core._exceptions.MemoryError: Unable to allocate 221. GiB for an array with shape (617037108, 96) and data type float32
Doesn't matter whether I use GPU or not.
Some of my spancats can get large (I get a list of names followed by a fixed word/phrase such as: "Defendants" or "Plaintiffs". Usually it is a fairly small list but it can grow to 10-20 names (20-60 tokens)
Is that the reason for my memory error? Anyway to debug this to know if there is something specific causing this, or how to work around?
It's possible that this is related to the suggester function, which by default, will use an ngram range of all the available spans lengths in the data. So if you have really long spans, you'll end up with a lot of potential candidates (e.g. all possible spans between 1 and 60 tokens, which can be a lot). If you run
prodigy train with the
--verbose flag, it should show you more detailed information on the suggester function used: https://prodi.gy/docs/span-categorization#suggesters
One option to prevent this would be to use a config that defines a different logic for potential span candidates via the suggester function: https://spacy.io/api/spancategorizer#suggesters How you set this up depends on the data, but there might be common patterns that you can use instead of considering every possible combination.
The suggester functions can also integrate with Prodigy during annotation so you can ensure that only spans matching the suggester can be selected: https://prodi.gy/docs/span-categorization#suggesters-annotation
Still new and learning the fundamentals but as a temporary resource solution you can throw capacity at it and get a spot Azure VM and test until the your system is optimized.
I have a spot instance (US-East 2) that I can use a ND96amsr_A100_v4 for ~$13-15/hr that includes:
- 8 A100 GPUs
- 1924 GB Memory
- 2900 GB Temp storage
Economies of scale
@ines I recently encountered a similar issue and setting the max size for the suggester to 60 (from 90) worked! 10.1/12.0 GB GPU memory in use for 50 different job postings with varying lengths.
@mwade-noetic I also set my gpu_allocator to "pytorch" so try that as well if you have not already.