Saving custom tokenizer

tba · March 13, 2018, 10:21am

Hi,
I’m trying to make this example (https://github.com/explosion/spacy/blob/master/examples/training/train_parser.py) work with a custom tokenizer like this one (https://spacy.io/usage/linguistic-features#custom-tokenizer-example). Works great, but when I try to save the model, spacy is being mean to me and telling me AttributeError: ‘CustomTokenizer’ object has no attribute ‘to_disk’
I’ve tried to avoid saving the custom tokenizer using this line : nlp.to_disk(output_dir, disable=[‘tokenizer’]), but when i try to load it back, spacy complains agains and tells me : FileNotFoundError: [Errno 2] No such file or directory: ‘modelname/tokenizer’
What’s the right way ?

honnibal · March 14, 2018, 2:28pm

I think spaCy should avoid trying to read the tokenizer back in if the directory is missing – so it seems there’s a bug there. But to fix your issue, I think it should be best to add to_disk and from_disk methods to your tokenizer:


class CustomTokenizer(object):
    def to_bytes(self):
        return pickle.dumps(self.__dict__)

    def from_bytes(self, data):
        self.__dict__.update(pickle.loads(data))

    def to_disk(self, path):
        with open(path, 'wb') as file_:
            file_.write(self.to_bytes())

    def from_disk(self, path):
        with open(path, 'rb') as file_:
            self.from_bytes(file_.read())

You’ll also want to override the Language.factories['tokenizer'] entry, so that the Language class knows to refer to your custom class. You could subclass English, or just write to the class attribute directly:


Language.factories['tokenizer'] = CustomTokenizer

KevinJ90825 · March 14, 2018, 4:25pm

Where do you actually do this overriding? For example my uncompiled model has directories parser, tagger, textcat, vocab and files meta.json, evaluation.jsonl, tokenizer, and training.jsonl. Do I create a new file to override the language factory in?

ines · March 14, 2018, 4:32pm

The easiest way is to do this globally, right before you’re loading the model.

If you want to include your overrides and custom logic with the model, you can wrap it as a Python package, which will set up the directory accordingly, and add a setup.py and __init__.py for your package. My comments on this thread go into a little more detail here.

python -m spacy package /your_model /output_dir

The model data directory will include an __init__.py and a load method, which is executed if you load the installed model from the package. You can modify the __init__.py to include your custom modifications, factory overrides or pipeline components. Running python setup.py sdist in the package directory will create an installable .tar.gz archive in a directory dist:

pip install dist/your_model-0.0.0.tar.gz

If your model package includes custom code, it’s important to always install the package, and not load only the model data from the data directory. (Otherwise, spaCy won’t execute the Python package and only consult the model’s meta.json.)

tba · March 14, 2018, 8:27pm

Thanks for the reply. When I add the from_disk method, i get this error :

File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py", line 595, in
('tokenizer', lambda p: self.tokenizer.to_disk(p, vocab=False)),
TypeError: to_disk() got an unexpected keyword argument 'vocab'

ines · March 14, 2018, 8:46pm

Ah, it looks like spaCy actually calls the tokenizer’s to_disk method with the keyword argument vocab – so it complains here, because your custom function doesn’t accept that (or any other) keyword arguments. To be safe, you could just do something like this:

def from_disk(self, path, **kwargs)

tba · March 14, 2018, 9:11pm

Great ! Model saved. Now when I try to load it again, using

nlp2 = spacy.load(output_dir)

I get the error below.

File "testspacy.py", line 196, in main
nlp2 = spacy.load(output_dir)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/init.py", line 19, in load
return util.load_model(name, **overrides)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py", line 119, in load_model
return load_model_from_path(name, **overrides)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py", line 159, in load_model_from_path
return nlp.from_disk(model_path)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py", line 638, in from_disk
util.from_disk(path, deserializers, exclude)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py", line 522, in from_disk
reader(path / key)
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py", line 626, in
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
File "tokenizer.pyx", line 371, in spacy.tokenizer.Tokenizer.from_disk
File "tokenizer.pyx", line 406, in spacy.tokenizer.Tokenizer.from_bytes
File "/Users/me/anaconda/lib/python3.6/site-packages/spacy/util.py", line 501, in from_bytes
msg = msgpack.loads(bytes_data, encoding='utf8')
File "/Users/me/anaconda/lib/python3.6/site-packages/msgpack_numpy.py", line 187, in unpackb
return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
File "msgpack/_unpacker.pyx", line 208, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:2717)
msgpack.exceptions.ExtraData: unpack(b) received extra data.

Is it because I'm not supposed to use spacy.load ? Am I supposed to turn the model into a package instead ?

ines · March 14, 2018, 9:26pm

tba:

File “/Users/me/anaconda/lib/python3.6/site-packages/spacy/language.py”, line 626, in 
(‘tokenizer’, lambda p: self.tokenizer.from_disk(p, vocab=False)),
File “tokenizer.pyx”, line 371, in spacy.tokenizer.Tokenizer.from_disk

It looks like spaCy is actually initialising its own tokenizer (spacy.tokenizer.Tokenizer) and then calling its from_disk method on load, which fails – instead of your custom tokenizer.

Could you try adding the following before you load the model:

def create_tokenizer(nlp):
    return CustomTokenizer(nlp)  # or however you custom tokenizer is initialised

Language.Defaults.create_tokenizer = CustomTokenizer

Because the tokenizer is "special" and not just a pipeline component, adding it to the factories as suggested by @honnibal above might not be enough. (This isn't ideal behaviour – spaCy should probably always refer to the factory here, just like it does for the other components.)

tba · March 14, 2018, 9:30pm

Perfect ! It worked.

Thanks !

Dim · September 23, 2018, 9:50am

I'm wondering how to save the custom tokenizer so that prodigy would use it?

Prodigy has the same errors as above:

➜ prodigy ner.manual rules_lines lines_lang_model data/rules_for_prodigy.json
/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")
Traceback (most recent call last):
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/prodigy/main.py", line 259, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 167, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 146, in manual
nlp = spacy.load(spacy_model)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/init.py", line 17, in load
return util.load_model(name, **overrides)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/util.py", line 116, in load_model
return load_model_from_path(Path(name), **overrides)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/util.py", line 156, in load_model_from_path
return nlp.from_disk(model_path)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/language.py", line 653, in from_disk
util.from_disk(path, deserializers, exclude)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/util.py", line 511, in from_disk
reader(path / key)
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/language.py", line 642, in
('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
File "tokenizer.pyx", line 367, in spacy.tokenizer.Tokenizer.from_disk
File "tokenizer.pyx", line 402, in spacy.tokenizer.Tokenizer.from_bytes
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/spacy/util.py", line 490, in from_bytes
msg = msgpack.loads(bytes_data, encoding='utf8')
File "/Users/dima/miniconda3/envs/fastai-cpu/lib/python3.6/site-packages/msgpack_numpy.py", line 187, in unpackb
return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
File "msgpack/_unpacker.pyx", line 208, in msgpack._unpacker.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.

honnibal · September 24, 2018, 8:14am

@Dim You might find it easiest just to write your own load function, to setup the nlp object how you like it. Then you can use this function in your recipes.

Another option is to use the spacy.lang.set_lang_class function. This lets you register a loader that maps a language key to a class or function that returns a Language object. For instance, the entry en is mapped to the English class. You could map the string "custom_loader" to a function my_load_function(), so long as that function returns a Language instance.

kak-to-tak · March 19, 2020, 7:11am

I assigned a CustomTokenizer to Language.Defaults.create_tokenizer in a wrong way and sort of broke everything. How do I get back to the initial state?

ines · March 20, 2020, 9:43am

Where did you assign that? Can't you just restart your program?

kak-to-tak · March 20, 2020, 12:04pm

I did and at this point everything is okay, but do I have to assign my tokenizer like that every time before I load a model? This makes impossible to use prodigy train ner via command line, as the models will always be loaded with a default one, or do I understand everything in a wrong way.
For annotations I use a custom recipe which assigns a custom tokenizer to a model, but how do I train with the same tokenizer? and save a model which will be loaded back again with my tokenizer?
For now I couldn't do anything but convert gold-to-spacy and use just spacy for training in notebooks as this way I can specify the tokenizer myself. Please let me know if I am in a total blur.

ines · March 20, 2020, 8:26pm

If your tokenizer needs custom code, you can package your spaCy model as a Python package and include the tokenizer code with the package. See here for details on packaging models: Saving and Loading · spaCy Usage Documentation

The package will then have an __init__.py with a load() function that's in charge of putting together the nlp object. That's also what spaCy calls under the hood when you load a model from a package. So you can edit that and include any other setup logic there – like writing to nlp.tokenizer or the Language class. You can then install your model package in your environment and your custom code will be excuted on load.

kak-to-tak · March 21, 2020, 8:19am

Thank you! I'll try it.

kak-to-tak · April 6, 2020, 3:15pm

Hi!
I did everything you said up to changing a load() function. I'm not sure that I do it correctly but what I've done doesn't work. The model still tokenizes not as a custom tokenizer.

from CustomFunctions.ctokenizer import CTokenizer


__version__ = get_model_meta(Path(__file__).parent)['version']


def load(**overrides):
    loaded = load_model_from_init_py(__file__, **overrides)
    loaded.tokenizer = CTokenizer(loaded.vocab, False)
    return loaded

ines · April 7, 2020, 9:23am

How are you running and loading your model? If you have custom code in your __init__.py, the model has to be installed and executed as a package, because otherwise, the code in the load method doesn't get executed and you end up with the correct data but without any of the overrides.

kak-to-tak · April 7, 2020, 1:54pm

I am not sure I understand what you mean to answer correctly. I have installed it as a package with this code placed into __init__.py but when I execute

nlp=spacy.load(my_model)
nlp('sometext")

the result is "sometext" instead of "some text"

ines · April 7, 2020, 3:11pm

Yes, that's what I meant – so you're definitely loading the model from a Python package, and you've verified that the right model is loaded and that your custom load function is executed and the tokenizer is replaced? (If you want to be extra sure, you could also add a print statement.)

Topic		Replies	Views
How to save a custom tokenizer usage , ner , spacy , solved	6	3382	October 9, 2020
Training after annotating with custom tokenizer spacy , transformers , training	3	333	November 8, 2023
Prodigy is losing my tokeniser usage , spacy	2	390	February 18, 2022
Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy ner , spacy	8	175	February 19, 2024
Error when using data-to-spacy done , nightly	3	428	June 28, 2021

Saving custom tokenizer

Related Topics