I have a NER workflow where prodigy memory usage goes off the chart at startup time.
My model has approx. 20 different labels, it is bootstrapped with
blank(language), and trained using spacy command line. It weighs 9 to 13MBytes (depending on the language and the training corpus, I guess).
But whatever the size of the model, when I launch prodigy with a
ner.correct (if that matters) with a jsonl input file, the startup memory usage sky rockets.
My input file : 2.5Mb, for 1250 samples. So each one is pretty big, but not that big. Each task only has a text, a input and task hash, and a meta with one attribute.
I added verbose logging to prodigy, the memory usage starts climbing after a bunch of FILTER messages (indeed, I had 5-10 invalid tasks, actually with an empty text), and before the CORS logging message. Removing those invalid tasks do not solve the issue.
By skyrocketing I mean going north of 2GB. On a constrained environment where I have only approx 1GB free, the out of memory killer consistently kills my prodigy before the server starts.
After the peak, and once the server has started, memory usage goes down to a more reasonable level (less than 250MB).
On a hunch, I disabled validation in my
prodigy.json file, and this makes the memory peak disappear. But it kinds of reach the same level once I load the first sample - maybe a bit less, but still climbs.
I feel the validation somehow buffers too many things in memory for some reason.
Is this normal, am I missing something ?
This happens to deep for me to debug : I added a
print() statement inside the prodigy ner.py file, in the prodigy package, just to see if the generator was called ahead of time to loop through all samples, and it is not (I only get one print statement after the server started). So whatever happens may be inside the native / loader code, I guess.
Prodigy versions are 1.10.2 and 1.10.4