I’m trying to make this example (https://github.com/explosion/spacy/blob/master/examples/training/train_parser.py) work with a custom tokenizer like this one (https://spacy.io/usage/linguistic-features#custom-tokenizer-example). Works great, but when I try to save the model, spacy is being mean to me and telling me AttributeError: ‘CustomTokenizer’ object has no attribute ‘to_disk’
I’ve tried to avoid saving the custom tokenizer using this line : nlp.to_disk(output_dir, disable=[‘tokenizer’]), but when i try to load it back, spacy complains agains and tells me : FileNotFoundError: [Errno 2] No such file or directory: ‘modelname/tokenizer’
What’s the right way ?
How to save a custom tokenizer
I think spaCy should avoid trying to read the tokenizer back in if the directory is missing – so it seems there’s a bug there. But to fix your issue, I think it should be best to add
from_disk methods to your tokenizer:
class CustomTokenizer(object): def to_bytes(self): return pickle.dumps(self.__dict__) def from_bytes(self, data): self.__dict__.update(pickle.loads(data)) def to_disk(self, path): with open(path, 'wb') as file_: file_.write(self.to_bytes()) def from_disk(self, path): with open(path, 'rb') as file_: self.from_bytes(file_.read())
You’ll also want to override the
Language.factories['tokenizer'] entry, so that the
Language class knows to refer to your custom class. You could subclass
English, or just write to the class attribute directly:
Language.factories['tokenizer'] = CustomTokenizer
Where do you actually do this overriding? For example my uncompiled model has directories
vocab and files
training.jsonl. Do I create a new file to override the language factory in?
The easiest way is to do this globally, right before you’re loading the model.
If you want to include your overrides and custom logic with the model, you can wrap it as a Python package, which will set up the directory accordingly, and add a
__init__.py for your package. My comments on this thread go into a little more detail here.
python -m spacy package /your_model /output_dir
The model data directory will include an
__init__.py and a
load method, which is executed if you load the installed model from the package. You can modify the
__init__.py to include your custom modifications, factory overrides or pipeline components. Running
python setup.py sdist in the package directory will create an installable
.tar.gz archive in a directory
pip install dist/your_model-0.0.0.tar.gz
If your model package includes custom code, it’s important to always install the package, and not load only the model data from the data directory. (Otherwise, spaCy won’t execute the Python package and only consult the model’s
Thanks for the reply. When I add the
from_disk method, i get this error :
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py”, line 595, in
(‘tokenizer’, lambda p: self.tokenizer.to_disk(p, vocab=False)),
TypeError: to_disk() got an unexpected keyword argument ‘vocab’
Ah, it looks like spaCy actually calls the tokenizer’s
to_disk method with the keyword argument
vocab – so it complains here, because your custom function doesn’t accept that (or any other) keyword arguments. To be safe, you could just do something like this:
def from_disk(self, path, **kwargs)
Great ! Model saved. Now when I try to load it again, using
nlp2 = spacy.load(output_dir)
I get the error below.
File “testspacy.py”, line 196, in main
nlp2 = spacy.load(output_dir)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/init.py”, line 19, in load
return util.load_model(name, **overrides)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py”, line 119, in load_model
return load_model_from_path(name, **overrides)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py”, line 159, in load_model_from_path
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py”, line 638, in from_disk
util.from_disk(path, deserializers, exclude)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py”, line 522, in from_disk
reader(path / key)
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py”, line 626, in
(‘tokenizer’, lambda p: self.tokenizer.from_disk(p, vocab=False)),
File “tokenizer.pyx”, line 371, in spacy.tokenizer.Tokenizer.from_disk
File “tokenizer.pyx”, line 406, in spacy.tokenizer.Tokenizer.from_bytes
File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py”, line 501, in from_bytes
msg = msgpack.loads(bytes_data, encoding=‘utf8’)
File “/Users/me/anaconda/lib/python3.6/site-packages/msgpack_numpy.py”, line 187, in unpackb
return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
File “msgpack/_unpacker.pyx”, line 208, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:2717)
msgpack.exceptions.ExtraData: unpack(b) received extra data.
Is it because I’m not supposed to use spacy.load ? Am I supposed to turn the model into a package instead ?
It looks like spaCy is actually initialising its own tokenizer (
spacy.tokenizer.Tokenizer) and then calling its
from_disk method on load, which fails – instead of your custom tokenizer.
Could you try adding the following before you load the model:
def create_tokenizer(nlp): return CustomTokenizer(nlp) # or however you custom tokenizer is initialised Language.Defaults.create_tokenizer = CustomTokenizer
Because the tokenizer is “special” and not just a pipeline component, adding it to the factories as suggested by @honnibal above might not be enough. (This isn’t ideal behaviour – spaCy should probably always refer to the factory here, just like it does for the other components.)
Perfect ! It worked.