ner.correct memory usage

Hi,

I have a NER workflow where prodigy memory usage goes off the chart at startup time.

My model has approx. 20 different labels, it is bootstrapped with blank(language), and trained using spacy command line. It weighs 9 to 13MBytes (depending on the language and the training corpus, I guess).

But whatever the size of the model, when I launch prodigy with a ner.correct (if that matters) with a jsonl input file, the startup memory usage sky rockets.

My input file : 2.5Mb, for 1250 samples. So each one is pretty big, but not that big. Each task only has a text, a input and task hash, and a meta with one attribute.

I added verbose logging to prodigy, the memory usage starts climbing after a bunch of FILTER messages (indeed, I had 5-10 invalid tasks, actually with an empty text), and before the CORS logging message. Removing those invalid tasks do not solve the issue.

By skyrocketing I mean going north of 2GB. On a constrained environment where I have only approx 1GB free, the out of memory killer consistently kills my prodigy before the server starts.

After the peak, and once the server has started, memory usage goes down to a more reasonable level (less than 250MB).

On a hunch, I disabled validation in my prodigy.json file, and this makes the memory peak disappear. But it kinds of reach the same level once I load the first sample - maybe a bit less, but still climbs.

I feel the validation somehow buffers too many things in memory for some reason.
Is this normal, am I missing something ?

This happens to deep for me to debug : I added a print() statement inside the prodigy ner.py file, in the prodigy package, just to see if the generator was called ahead of time to loop through all samples, and it is not (I only get one print statement after the server started). So whatever happens may be inside the native / loader code, I guess.

Prodigy versions are 1.10.2 and 1.10.4

Thanks for the detailed report and analysis! This is strange and I haven't seen that one before :thinking:

Could you run pip list and check which version of pydantic you have installed? And if it's 1.7.x, could you downgrade to v1.6.x and try again? Also, this is kind of random and might not have an impact, but it's the one setting that affects the code paths for the streams: could you try setting "feed_overlap": false and see what happens?

For tonight, I can confirm that downgrading pydantic to <1.7.0 does not change the behavior. I'll report back later for more.

THanks.

So I've tried with a mix of pydantic<1.7.0, python 3.6 or 3.8, feed_overlap true or false, and I still see the same behavior everytime.

Thanks for checking! (My initial theory was that maybe the new pydantic version was performing more checks of data structures that we previously didn't validate, and that there might be one check we forgot to put behind a if validate: condition. But that doesn't seem to be it.)

From what you describe, it does sound like the bottleneck here is the texts being processed by the model – that's kinda the only significant thing that happens in the stream. Can you reproduce the same behaviour if you process your texts directly with spaCy?

# Variant 1
for doc in nlp.pipe(your_texts):
    print(doc.ents)

# Variant 2
for text in your_texts:
    doc = nlp(text)
    print(doc.ents)

One possible explanation would be that there's maybe just one single example that's significantly longer than all of the other ones and trips up the model.

Thanks.
Trying that here :

lines = []
with open("work_sessions/batch.jsonl") as f:
  lines = list(f.readlines())
lines = map(json.loads, lines)
lines = map(lambda t: t["text"], lines)

So I have loaded my json. My Python interpreter is at 7MB.

>>> import spacy
>>> nlp = spacy.load("target")

Loading the model : 49MB of RSS size.

# Variant 1
for doc in nlp.pipe(lines)
    print(doc.ents)

This creates a big memory peak.

The variant 2 does not, (stable at around 70MB).

max(map(lambda l: len(l), lines))
11006

The maximum text length is 11000 chars, the maximum text length in tokens is 1828

 max([len(doc) for doc in map(nlp, lines)])
1828

But you are right that a single example is bigger than the others. The next bigger is at 8k, and there are only 25 examples above 5k characters.
Removing all texts > 8k characters still yield a 900MB usage.

Cutting at 5k still yields 750MB usage.

It still feels a lot. And my understanding would be that prodigy does not buffer all text/documents, but only mini batches them.

If I take then 30 longest text and pipe them all at once, I do not see such a peak (it climbs to 250Mb for the slice)

Auto correcting myself, I see that the ner.correct recipe uses nlp.pipe without specifying a batch size, which means the default of 1000 tasks get processed by the task generator. If I modify it to 100, the problem disappears.

So there is a 1000 buffering at the prodigy level, and the memory usage is bounded by the size of the processing those.

IMO, this is a bit heavy for an annotation UI.

Of course, most of the times, there will only be short, sentence segmented texts, and 1000 of them is not that much.

But for some uses, or cases when the language does not offer sentence segmentation, this RAM usage and incurred startup time matters.

This is a tradeoff, I understand.

As an annotator using the UI, I could totally accept a 10 to 100ms delay every 10 or 100 tasks (waiting for a nlp.pipe() batch to complete), instead of waiting 5 seconds at launch for 1000 tasks I will most of the times not annotate in a row.

That, and the 1GB memory usage.

Ahhh, that makes sense, thanks for getting to the bottom of this! Prodigy typically sets a default batch size here that's much lower, but it looks like in this case the recipe doesn't. I'll update this for the next release! (I'm surprised that this has never come up before!)

You're definitely right that it's always a trade-off and we may not find the one setting that's perfect for every use case, but I'm glad we found the solution and at least it's something that's pretty easy to customise if users need it :slightly_smiling_face:

Update: Just released v1.10.5, which sets a default batch size of 10 on the relevant calls to nlp.pipe in the recipes :slightly_smiling_face:

That's great! Thanks.