Saving custom tokenizer

The easiest way is to do this globally, right before you’re loading the model.

If you want to include your overrides and custom logic with the model, you can wrap it as a Python package, which will set up the directory accordingly, and add a setup.py and __init__.py for your package. My comments on this thread go into a little more detail here.

python -m spacy package /your_model /output_dir

The model data directory will include an __init__.py and a load method, which is executed if you load the installed model from the package. You can modify the __init__.py to include your custom modifications, factory overrides or pipeline components. Running python setup.py sdist in the package directory will create an installable .tar.gz archive in a directory dist:

pip install dist/your_model-0.0.0.tar.gz

If your model package includes custom code, it’s important to always install the package, and not load only the model data from the data directory. (Otherwise, spaCy won’t execute the Python package and only consult the model’s meta.json.)