I am trying to train a span category and it runs out of memory before any actual training starts. I am using the latest version of spacy 3.2.3 and prodigy 1.11.7. Additionally I have install GPU support (cu113). I have annotated a set of 240 span cats and when I run the training using this command:
My system has 64GB of RAM with 32GB of swap. GPU has 24GB (RTX 3090).
File "/home/mwade/PycharmProjects/DRD_MultiCat_Model/venv/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/mwade/PycharmProjects/DRD_MultiCat_Model/venv/lib/python3.9/site-packages/thinc/model.py", line 291, in call
return self._func(self, X, is_train=is_train)
File "/home/mwade/PycharmProjects/DRD_MultiCat_Model/venv/lib/python3.9/site-packages/spacy/ml/extract_spans.py", line 32, in forward
Y = Ragged(X.dataXd[indices], spans.dataXd[:, 1] - spans.dataXd[:, 0]) # type: ignore[arg-type, index]
numpy.core._exceptions.MemoryError: Unable to allocate 221. GiB for an array with shape (617037108, 96) and data type float32
Doesn't matter whether I use GPU or not.
Some of my spancats can get large (I get a list of names followed by a fixed word/phrase such as: "Defendants" or "Plaintiffs". Usually it is a fairly small list but it can grow to 10-20 names (20-60 tokens)
Is that the reason for my memory error? Anyway to debug this to know if there is something specific causing this, or how to work around?
It's possible that this is related to the suggester function, which by default, will use an ngram range of all the available spans lengths in the data. So if you have really long spans, you'll end up with a lot of potential candidates (e.g. all possible spans between 1 and 60 tokens, which can be a lot). If you run prodigy train with the --verbose flag, it should show you more detailed information on the suggester function used: Span Categorization · Prodigy · An annotation tool for AI, Machine Learning & NLP
One option to prevent this would be to use a config that defines a different logic for potential span candidates via the suggester function: SpanCategorizer · spaCy API Documentation How you set this up depends on the data, but there might be common patterns that you can use instead of considering every possible combination.
Still new and learning the fundamentals but as a temporary resource solution you can throw capacity at it and get a spot Azure VM and test until the your system is optimized.
I have a spot instance (US-East 2) that I can use a ND96amsr_A100_v4 for ~$13-15/hr that includes:
@ines I recently encountered a similar issue and setting the max size for the suggester to 60 (from 90) worked! 10.1/12.0 GB GPU memory in use for 50 different job postings with varying lengths.
@mwade-noetic I also set my gpu_allocator to "pytorch" so try that as well if you have not already.