Custom Tokenizer

We have healthcare specific domain tokenizer and would like to use it for text classification instead of the default one.
how should I go about it?

Sure, that should be no problem! Here are two possible solutions:

  1. Save out the model you’re using, package it using the spacy package command and add your custom tokenizer to the model package’s Models are regular Python packages, so you can ship any code with them and execute it within the model’s load() method. See here for more details.

  2. Use a custom recipe that loads a spaCy model and adds your custom tokenizer (instead of just calling nlp = spacy.load(spacy_model).

Can I do something like this below in the custom recipe as well?

nlp.tokenizer.add_special_case(“x-ray”, [{ORTH: “x-ray”}])

Sure! You can modify the nlp object and its tokenizer however you want.