How to save a custom tokenizer

Hi,
I am new to spacy so this is my first time using this forum.

I have problem on saving custom tokenizer. I have trained a model as described over here https://spacy.io/usage/training#example-train-ner and got decent results without doing anything but to get high accuracy I had to use custom tokenizer as described over here https://spacy.io/usage/linguistic-features#native-tokenizers . But when i try to save the trained model it was not allowing me to save the model sines it couldn’t find to_disk() function in custom tokenizer config.So I found that, by changing “nlp.to_disk(output_dir)” to “nlp.to_disk(output_dir, disable=[‘tokenizer’])” did allow me to save the model without tokenizer. But when i try to load the model it says it cannot find tokenizer in model folder. So i tried to find a way to disable tokenizer while loading but i was unsuccessful. I found a helpful thread over here ‘/saving-custom-tokenizer/395’(i cant add full url due to number of links restriction in this forum) but i cant figure out how to implement the suggested fixes discussed in that thread with the way i have modified my tokenizer. Can anyone please help me on guiding me how to properly fix this issue.

Sorry about the link restriction – this is just a setting for new accounts, to prevent spam bots.

Could you share the code of your custom tokenizer?

The solution discussed in this thread is definitely an option, especially for more advanced custom tokenizers – but depending on your code, it might not actually be necessary to implement all of the methods from scratch. And while you can package the tokenizer with the model if you want it to be super elegant, you don’t have to, especially not in the beginning while you’re still getting familiar with spaCy’s API :slightly_smiling_face:

So for now, the most important thing is to make the model save out something (anything!) as the tokenizer, so you can save it and load it back in without problems. You can then always re-add your tokenizer afterwards, by writing to nlp.tokenizer of the loaded model:

nlp = spacy.load('/your-custom-model')
nlp.tokenizer = your_custom_tokenizer(nlp)

doc = nlp(u"I will be processed with the custom tokenizer")

Hi ines,

Thank you for the quick reply :slightly_smiling_face: . Yes if I have some way to load the model I have saved without tokenizer, it would be enough for now. But I am very keen on learning how to properly save model with custom tokenizer :slightly_smiling_face: .

This is the simple tokenizer i have created for training.

prefix_re = re.compile(r'''1/|2/|:[0-9][0-9][A-K]:|:[0-9][0-9]:''')
suffix_re = re.compile(r'''''')
infix_re = re.compile(r'''[~]''')
 
def my_tokenizer_pri(nlp):
    return Tokenizer(nlp.vocab,
                     {},
                     prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer)

I am just using this to replace tokenizer while training

nlp.tokenizer = my_tokenizer_pri(nlp)

But only problem is i have no clue on how to save this tokenizer. I was able to save model without tokenizer but when I try to load it, it still search for tokenizer in model folder. Please tell me all possible ways of fixing this issue.

Thanks for sharing your code! I think you might have actually come across a bug or a bad example in the docs – sorry about that :woman_facepalming: It turns out that the tokenizer also expects a value for the token_match setting (which can be used in some languages to define more complex rules for tokens). In your custom function, try the following – this will use the default value:

def my_tokenizer_pri(nlp):
    return Tokenizer(nlp.vocab,
                     {},
                     prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,
                     token_match=nlp.tokenizer.token_match)

Since your custom tokenizer only defines custom rules, I think all of them will be saved out with the model. So when you load it back in, they should also be available and you don’t have to add anything manually. But you might want to double-check this on an example that shows the difference, just to be sure!

(I’ll also make a fix to spaCy that resolves the underlying issue :slightly_smiling_face:)

Hi Ines,

Really sorry for the late reply, went on a vacation so I was unable to reply. Your solution did work. Thanks for the support. :slightly_smiling_face:

1 Like

I tried saving a modified en_core_web_sm:
nlp.to_disk("en_core_web_sm_modified").

It saves it to en_core_web_sm_modified folder, but when I generate a package and try to install it, it still has name en_core_web_sm and thus would over-write the existing package. I figured I can modify the official name in meta.json of the generated package.

Now I renamed it to core_web_sm_modified in meta.json and installed it with pip install .

Next, I am trying to import a pre-annotated text for NER with it:
cat my.jsonl | prodigy ner.manual projectname en_core_web_sm_modified - --loader jsonl

Here I am getting an error:

FileNotFoundError: [Errno 2] No such file or directory: '/Applications/anaconda3/envs/spacy/lib/python3.7/site-packages/en_core_web_sm_modified/en_core_web_sm_modified-2.3.1/tokenizer'

Indeed, ls /Applications/anaconda3/envs/spacy/lib/python3.7/site-packages/en_core_web_sm_modified

contains only these 3 files:
__init__.py __pycache__ meta.json

Here I realized that I had to manually rename the inner directory from ./en_core_web_sm ro ./en_core_web_sm_modified.

I am not sure whether this behavior with naming of derived packages is intended or is a bug. It is a bit confusing though.

If you want to create a Python package from your model, the spacy package command should take care of creating the package directories, making sure the module name matches etc. So after saving out the modified nlp object, you can edit the meta and then auto-genrate the package: Command Line Interface · spaCy API Documentation

Alternatively, you can also just load your model from a file path, which is typically the easiest way during development. The spaCy model you pass into Prodigy can be anything that can be loaded with spacy.load, so either a package or a local path.

1 Like