I created and saved a blank en model with a custom tokeniser - it a bit of a hack in an attempt to create a character model.
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex
nlp = spacy.blank('en')
infixes = (r".",)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer = Tokenizer(
nlp.vocab,
suffix_search=None,
prefix_search=None,
infix_finditer=infix_re.finditer,
token_match=None)
nlp.add_pipe('ner')
nlp.begin_training()
This works, I can save and load it ok and it maintains the tokenization.
print([t for t in nlp("NAE-03/N2-1.VAV-156.AC_8-6 STA-DUR")])
> [N, A, E, -, 0, 3, /, N, 2, -, 1, ., V, A, V, -, 1, 5, 6, ., A, C, _, 8, -, 6, S, T, A, -, D, U, R]
nlp.to_disk("models/char-blank")
nlp = spacy.load("models/char-blank")
print([t for t in nlp("NAE-03/N2-1.VAV-156.AC_8-6 STA-DUR")])
> [N, A, E, -, 0, 3, /, N, 2, -, 1, ., V, A, V, -, 1, 5, 6, ., A, C, _, 8, -, 6, S, T, A, -, D, U, R]
I've used ner.manual
to create a dataset and then trained using train
prodigy train models/model --ner dataset
Problem is when I load the trained model, the tokeniser seems to have changed, i.e.
import spacy
nlp = spacy.load("models/model/model-best")
print([t for t in nlp("NAE-03/N2-1.VAV-156.AC_8-6 STA-DUR")])
> [NAE-03, /, N2, -, 1.VAV-156.AC_8, -, 6, STA, -, DUR]
It seems to be using the default spacy english tokenizer (which fails on my corpus as it's technical IoT data).
I can clearly see the tokenizer
file in the models/model/model-best
is different from the one from my model.
Also tried
prodigy train models/model --ner dataset --base-model models/char-blank
But the same issue occurred, the tokenizer was changed