Custom Tokenizer

madhujahagirdar · February 8, 2018, 5:03am

We have healthcare specific domain tokenizer and would like to use it for text classification instead of the default one.
how should I go about it?

ines · February 8, 2018, 5:17am

Sure, that should be no problem! Here are two possible solutions:

Save out the model you’re using, package it using the spacy package command and add your custom tokenizer to the model package’s __init__.py. Models are regular Python packages, so you can ship any code with them and execute it within the model’s load() method. See here for more details.
Use a custom recipe that loads a spaCy model and adds your custom tokenizer (instead of just calling nlp = spacy.load(spacy_model).

madhujahagirdar · February 8, 2018, 12:45pm

Can I do something like this below in the custom recipe as well?

nlp.tokenizer.add_special_case(“x-ray”, [{ORTH: “x-ray”}])

ines · February 8, 2018, 1:21pm

Sure! You can modify the nlp object and its tokenizer however you want.

Topic		Replies	Views
How to save a custom tokenizer usage , ner , spacy , solved	6	3701	October 9, 2020
Saving custom tokenizer spacy , solved	24	4723	November 2, 2021
Train recipe uses different Tokenizer than in ner.manual ner	1	324	August 8, 2023
Prodigy is losing my tokeniser usage , spacy	2	419	February 18, 2022
How to define a custom Tokenizer when using prodigy? usage , spacy , solved	3	432	September 20, 2021