Lightweight version of spacy for inference ?


I've trained a custom NER model that works pretty well and that I want to deploy as an API. I've chosen to do so using AWS Apigateway + Lambda. I managed to get it all working and its doing exactly what I need. My only concern is the cold start of the lambda functions where spacy is imported and where the model is loaded.

I've been looking online if there were some "lightweight" versions of spacy made specifically for inference or if there was any guidance on how to strip the library to make the import faster and couldn't find anything. This part takes something like 1.5 seconds in a 1GB lambda function.

I've also tried to see if I could improve the model loading time which takes something like 6s to load on a cold start. The model is small enough to fit in a lambda Layer, so there is no way to put the model closer to the function. I've tried to disable some parts of the model on load to save time with no success. I've also tried to increase the power given to the lambda function and it did not speed the loading either.

Any speed improvement I could get would make me happy. If you tell me there isn't really an easy way to speed things up, I'll tell the frontend teams to deal with it unless they want to pay for a server with the model loaded 24/7.

Thanks in advance,


Hi Marc!

Some of this you might have already done, but here's a few optimizations I can think of:

  1. You can try a few production practices from the docs. One of which involves including the model in your requirements.txt to avoid the download step. Although you might need to be wary of your model size going way beyond AWS' layer limits.

  2. Lastly, if you don't need all the pipeline components, you can disable them, and it might improve loading speed. Of course, benchmark it first to ensure that performance is still the same:

nlp = spacy.load("path/to/my/model", disable=["tagger", "parser"])

I'd also be a bit more careful on Suggestion # 2, especially if you're working on the default pipeline structure with tok2vec. I'd suggestion looking at this documentation when disabling different parts of the pipeline.

Hey, thanks for the answer.

The model and the spacy library are fitting in a Lambda Layer, there is no problem with this part.

For #2, I did try to disable some parts of the model on load to save time with no success, the model load time was the exact same.

Ah, this is a detail that changed between v2 and v3: disable still loads the components underneath, but just disables them, to make it easy to enable them on demand. To skip loading them entirely you can try exclude instead, but in this context you might be better off with a custom model that doesn't contain unneeded components?

But especially if the model contains vectors, it's always going to take a bit of time to load all the data. Something in the range of 5s isn't unusual for a model similar to en_core_web_lg.

Didn't know about "exclude" and I just tried it now with the exact same result as "disable". It took the same time to load my model if I excluded everything but NER, the same time if I excluded everything including NER and the same result without excluding anything. Here are the components I knew I could exclude : ["tagger", "parser", "attribute_ruler", "lemmatizer", "ner"]

I trained a new blank:fr model and did not start from and trained one.

Anything else I am missing ?

Thanks in advance!