Saving custom tokenizer

spacy
solved

#1

Hi,
I’m trying to make this example (https://github.com/explosion/spacy/blob/master/examples/training/train_parser.py) work with a custom tokenizer like this one (https://spacy.io/usage/linguistic-features#custom-tokenizer-example). Works great, but when I try to save the model, spacy is being mean to me and telling me AttributeError: ‘CustomTokenizer’ object has no attribute ‘to_disk’ :frowning:
I’ve tried to avoid saving the custom tokenizer using this line : nlp.to_disk(output_dir, disable=[‘tokenizer’]), but when i try to load it back, spacy complains agains and tells me : FileNotFoundError: [Errno 2] No such file or directory: ‘modelname/tokenizer’
What’s the right way ?


How to save a custom tokenizer
(Matthew Honnibal) #2

I think spaCy should avoid trying to read the tokenizer back in if the directory is missing – so it seems there’s a bug there. But to fix your issue, I think it should be best to add to_disk and from_disk methods to your tokenizer:


class CustomTokenizer(object):
    def to_bytes(self):
        return pickle.dumps(self.__dict__)

    def from_bytes(self, data):
        self.__dict__.update(pickle.loads(data))

    def to_disk(self, path):
        with open(path, 'wb') as file_:
            file_.write(self.to_bytes())

    def from_disk(self, path):
        with open(path, 'rb') as file_:
            self.from_bytes(file_.read())

You’ll also want to override the Language.factories['tokenizer'] entry, so that the Language class knows to refer to your custom class. You could subclass English, or just write to the class attribute directly:


Language.factories['tokenizer'] = CustomTokenizer

(Kevin) #3

Where do you actually do this overriding? For example my uncompiled model has directories parser, tagger, textcat, vocab and files meta.json, evaluation.jsonl, tokenizer, and training.jsonl. Do I create a new file to override the language factory in?


(Ines Montani) #4

The easiest way is to do this globally, right before you’re loading the model.

If you want to include your overrides and custom logic with the model, you can wrap it as a Python package, which will set up the directory accordingly, and add a setup.py and __init__.py for your package. My comments on this thread go into a little more detail here.

python -m spacy package /your_model /output_dir

The model data directory will include an __init__.py and a load method, which is executed if you load the installed model from the package. You can modify the __init__.py to include your custom modifications, factory overrides or pipeline components. Running python setup.py sdist in the package directory will create an installable .tar.gz archive in a directory dist:

pip install dist/your_model-0.0.0.tar.gz

If your model package includes custom code, it’s important to always install the package, and not load only the model data from the data directory. (Otherwise, spaCy won’t execute the Python package and only consult the model’s meta.json.)


#5

Thanks for the reply. When I add the from_disk method, i get this error :

File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py”, line 595, in
(‘tokenizer’, lambda p: self.tokenizer.to_disk(p, vocab=False)),
TypeError: to_disk() got an unexpected keyword argument ‘vocab’


(Ines Montani) #6

Ah, it looks like spaCy actually calls the tokenizer’s to_disk method with the keyword argument vocab – so it complains here, because your custom function doesn’t accept that (or any other) keyword arguments. To be safe, you could just do something like this:

def from_disk(self, path, **kwargs)

#7

Great ! Model saved. Now when I try to load it again, using

nlp2 = spacy.load(output_dir)

I get the error below.

File “testspacy.py”, line 196, in main
nlp2 = spacy.load(output_dir)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/init.py”, line 19, in load
return util.load_model(name, **overrides)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py”, line 119, in load_model
return load_model_from_path(name, **overrides)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py”, line 159, in load_model_from_path
return nlp.from_disk(model_path)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py”, line 638, in from_disk
util.from_disk(path, deserializers, exclude)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py”, line 522, in from_disk
reader(path / key)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py”, line 626, in
(‘tokenizer’, lambda p: self.tokenizer.from_disk(p, vocab=False)),
File “tokenizer.pyx”, line 371, in spacy.tokenizer.Tokenizer.from_disk
File “tokenizer.pyx”, line 406, in spacy.tokenizer.Tokenizer.from_bytes
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py”, line 501, in from_bytes
msg = msgpack.loads(bytes_data, encoding=‘utf8’)
File “/Users/me/anaconda/lib/python3.6/site-packages/msgpack_numpy.py”, line 187, in unpackb
return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
File “msgpack/_unpacker.pyx”, line 208, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:2717)
msgpack.exceptions.ExtraData: unpack(b) received extra data.

Is it because I’m not supposed to use spacy.load ? Am I supposed to turn the model into a package instead ?


(Ines Montani) #8

It looks like spaCy is actually initialising its own tokenizer (spacy.tokenizer.Tokenizer) and then calling its from_disk method on load, which fails – instead of your custom tokenizer.

Could you try adding the following before you load the model:

def create_tokenizer(nlp):
    return CustomTokenizer(nlp)  # or however you custom tokenizer is initialised

Language.Defaults.create_tokenizer = CustomTokenizer

Because the tokenizer is “special” and not just a pipeline component, adding it to the factories as suggested by @honnibal above might not be enough. (This isn’t ideal behaviour – spaCy should probably always refer to the factory here, just like it does for the other components.)


#9

Perfect ! It worked.

Thanks !