We have healthcare specific domain tokenizer and would like to use it for text classification instead of the default one.
how should I go about it?
Sure, that should be no problem! Here are two possible solutions:
-
Save out the model you’re using, package it using the
spacy package
command and add your custom tokenizer to the model package’s__init__.py
. Models are regular Python packages, so you can ship any code with them and execute it within the model’sload()
method. See here for more details. -
Use a custom recipe that loads a spaCy model and adds your custom tokenizer (instead of just calling
nlp = spacy.load(spacy_model)
.
Can I do something like this below in the custom recipe as well?
nlp.tokenizer.add_special_case(“x-ray”, [{ORTH: “x-ray”}])
Sure! You can modify the nlp
object and its tokenizer however you want.