Custom Tokenizer

usage
spacy
solved

(Madhu Jahagirdar) #1

We have healthcare specific domain tokenizer and would like to use it for text classification instead of the default one.
how should I go about it?


(Ines Montani) #2

Sure, that should be no problem! Here are two possible solutions:

  1. Save out the model you’re using, package it using the spacy package command and add your custom tokenizer to the model package’s __init__.py. Models are regular Python packages, so you can ship any code with them and execute it within the model’s load() method. See here for more details.

  2. Use a custom recipe that loads a spaCy model and adds your custom tokenizer (instead of just calling nlp = spacy.load(spacy_model).


(Madhu Jahagirdar) #3

Can I do something like this below in the custom recipe as well?

nlp.tokenizer.add_special_case(“x-ray”, [{ORTH: “x-ray”}])


(Ines Montani) #4

Sure! You can modify the nlp object and its tokenizer however you want.