Saving custom tokenizer

Hi,
I’m trying to make this example (https://github.com/explosion/spacy/blob/master/examples/training/train_parser.py) work with a custom tokenizer like this one (https://spacy.io/usage/linguistic-features#custom-tokenizer-example). Works great, but when I try to save the model, spacy is being mean to me and telling me AttributeError: ‘CustomTokenizer’ object has no attribute ‘to_disk’ :frowning:
I’ve tried to avoid saving the custom tokenizer using this line : nlp.to_disk(output_dir, disable=[‘tokenizer’]), but when i try to load it back, spacy complains agains and tells me : FileNotFoundError: [Errno 2] No such file or directory: ‘modelname/tokenizer’
What’s the right way ?

I think spaCy should avoid trying to read the tokenizer back in if the directory is missing – so it seems there’s a bug there. But to fix your issue, I think it should be best to add to_disk and from_disk methods to your tokenizer:


class CustomTokenizer(object):
    def to_bytes(self):
        return pickle.dumps(self.__dict__)

    def from_bytes(self, data):
        self.__dict__.update(pickle.loads(data))

    def to_disk(self, path):
        with open(path, 'wb') as file_:
            file_.write(self.to_bytes())

    def from_disk(self, path):
        with open(path, 'rb') as file_:
            self.from_bytes(file_.read())

You’ll also want to override the Language.factories['tokenizer'] entry, so that the Language class knows to refer to your custom class. You could subclass English, or just write to the class attribute directly:


Language.factories['tokenizer'] = CustomTokenizer

Where do you actually do this overriding? For example my uncompiled model has directories parser, tagger, textcat, vocab and files meta.json, evaluation.jsonl, tokenizer, and training.jsonl. Do I create a new file to override the language factory in?

The easiest way is to do this globally, right before you’re loading the model.

If you want to include your overrides and custom logic with the model, you can wrap it as a Python package, which will set up the directory accordingly, and add a setup.py and __init__.py for your package. My comments on this thread go into a little more detail here.

python -m spacy package /your_model /output_dir

The model data directory will include an __init__.py and a load method, which is executed if you load the installed model from the package. You can modify the __init__.py to include your custom modifications, factory overrides or pipeline components. Running python setup.py sdist in the package directory will create an installable .tar.gz archive in a directory dist:

pip install dist/your_model-0.0.0.tar.gz

If your model package includes custom code, it’s important to always install the package, and not load only the model data from the data directory. (Otherwise, spaCy won’t execute the Python package and only consult the model’s meta.json.)

Thanks for the reply. When I add the from_disk method, i get this error :

File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py", line 595, in
('tokenizer', lambda p: self.tokenizer.to_disk(p, vocab=False)),
TypeError: to_disk() got an unexpected keyword argument 'vocab'

Ah, it looks like spaCy actually calls the tokenizer’s to_disk method with the keyword argument vocab – so it complains here, because your custom function doesn’t accept that (or any other) keyword arguments. To be safe, you could just do something like this:

def from_disk(self, path, **kwargs)
1 Like

Great ! Model saved. Now when I try to load it again, using

nlp2 = spacy.load(output_dir)

I get the error below.

File "testspacy.py", line 196, in main
nlp2 = spacy.load(output_dir)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/init.py", line 19, in load
return util.load_model(name, **overrides)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py", line 119, in load_model
return load_model_from_path(name, **overrides)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py", line 159, in load_model_from_path
return nlp.from_disk(model_path)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py", line 638, in from_disk
util.from_disk(path, deserializers, exclude)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py", line 522, in from_disk
reader(path / key)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py", line 626, in
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
File "tokenizer.pyx", line 371, in spacy.tokenizer.Tokenizer.from_disk
File "tokenizer.pyx", line 406, in spacy.tokenizer.Tokenizer.from_bytes
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py", line 501, in from_bytes
msg = msgpack.loads(bytes_data, encoding='utf8')
File "/Users/me/anaconda/lib/python3.6/site-packages/msgpack_numpy.py", line 187, in unpackb
return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
File "msgpack/_unpacker.pyx", line 208, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:2717)
msgpack.exceptions.ExtraData: unpack(b) received extra data.

Is it because I'm not supposed to use spacy.load ? Am I supposed to turn the model into a package instead ?

It looks like spaCy is actually initialising its own tokenizer (spacy.tokenizer.Tokenizer) and then calling its from_disk method on load, which fails – instead of your custom tokenizer.

Could you try adding the following before you load the model:

def create_tokenizer(nlp):
    return CustomTokenizer(nlp)  # or however you custom tokenizer is initialised

Language.Defaults.create_tokenizer = CustomTokenizer

Because the tokenizer is "special" and not just a pipeline component, adding it to the factories as suggested by @honnibal above might not be enough. (This isn't ideal behaviour – spaCy should probably always refer to the factory here, just like it does for the other components.)

Perfect ! It worked.

Thanks !

1 Like

I'm wondering how to save the custom tokenizer so that prodigy would use it?

Prodigy has the same errors as above:

➜ prodigy ner.manual rules_lines lines_lang_model data/rules_for_prodigy.json
/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")
Traceback (most recent call last):
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/prodigy/main.py", line 259, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 146, in manual
nlp = spacy.load(spacy_model)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/init.py", line 17, in load
return util.load_model(name, **overrides)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/util.py", line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/util.py", line 156, in load_model_from_path
return nlp.from_disk(model_path)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/language.py", line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/util.py", line 511, in from_disk
reader(path / key)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/language.py", line 642, in
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
File "tokenizer.pyx", line 367, in spacy.tokenizer.Tokenizer.from_disk
File "tokenizer.pyx", line 402, in spacy.tokenizer.Tokenizer.from_bytes
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/util.py", line 490, in from_bytes
msg = msgpack.loads(bytes_data, encoding='utf8')
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/msgpack_numpy.py", line 187, in unpackb
return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
File "msgpack/_unpacker.pyx", line 208, in msgpack._unpacker.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.

@Dim You might find it easiest just to write your own load function, to setup the nlp object how you like it. Then you can use this function in your recipes.

Another option is to use the spacy.lang.set_lang_class function. This lets you register a loader that maps a language key to a class or function that returns a Language object. For instance, the entry en is mapped to the English class. You could map the string "custom_loader" to a function my_load_function(), so long as that function returns a Language instance.

I assigned a CustomTokenizer to Language.Defaults.create_tokenizer in a wrong way and sort of broke everything. How do I get back to the initial state?

Where did you assign that? Can't you just restart your program?

I did and at this point everything is okay, but do I have to assign my tokenizer like that every time before I load a model? This makes impossible to use prodigy train ner via command line, as the models will always be loaded with a default one, or do I understand everything in a wrong way.
For annotations I use a custom recipe which assigns a custom tokenizer to a model, but how do I train with the same tokenizer? and save a model which will be loaded back again with my tokenizer?
For now I couldn't do anything but convert gold-to-spacy and use just spacy for training in notebooks as this way I can specify the tokenizer myself. Please let me know if I am in a total blur.

If your tokenizer needs custom code, you can package your spaCy model as a Python package and include the tokenizer code with the package. See here for details on packaging models: Saving and Loading · spaCy Usage Documentation

The package will then have an __init__.py with a load() function that's in charge of putting together the nlp object. That's also what spaCy calls under the hood when you load a model from a package. So you can edit that and include any other setup logic there – like writing to nlp.tokenizer or the Language class. You can then install your model package in your environment and your custom code will be excuted on load.

Thank you! I'll try it.

1 Like

Hi!
I did everything you said up to changing a load() function. I'm not sure that I do it correctly but what I've done doesn't work. The model still tokenizes not as a custom tokenizer.

from CustomFunctions.ctokenizer import CTokenizer


__version__ = get_model_meta(Path(__file__).parent)['version']


def load(**overrides):
    loaded = load_model_from_init_py(__file__, **overrides)
    loaded.tokenizer = CTokenizer(loaded.vocab, False)
    return loaded

How are you running and loading your model? If you have custom code in your __init__.py, the model has to be installed and executed as a package, because otherwise, the code in the load method doesn't get executed and you end up with the correct data but without any of the overrides.

I am not sure I understand what you mean to answer correctly. I have installed it as a package with this code placed into __init__.py but when I execute

nlp=spacy.load(my_model)
nlp('sometext") 

the result is "sometext" instead of "some text"

Yes, that's what I meant – so you're definitely loading the model from a Python package, and you've verified that the right model is loaded and that your custom load function is executed and the tokenizer is replaced? (If you want to be extra sure, you could also add a print statement.)