kaiser
August 8, 2023, 7:23am
1
Hey, I modified the blank:xx model, by overriding the infix rule:
Initialize a blank model
nlp_de = spacy.blank("xx")
Override the infix rule to add the hyphen as a separator for the current loaded model
infix_rules = nlp_de.Defaults.infixes + [r'''[-~()=]|(?<=[a-zA-Z0-9])*^ |(?<=[a-zA-Z]):<>/ |(?<=()[^)]*(?=))''']
infix_re = spacy.util.compile_infix_regex(infix_rules)
Set the modified infix rule to the tokenizer
nlp_de.tokenizer.infix_finditer = infix_re.finditer
Save the modified model to disk
nlp_de.to_disk("extended_xx_model")
basically like this: (Creating a custom spaCy tokenizer to use with Prodigy - virtual7 GmbH - Blog )
But when i use the train recipe it uses the default tokenizer tokenizer = {"@tokenizers ":"spacy.Tokenizer.v1"}
I tried to modify the config in various ways and even added the modifies infix_finditer in the config
[components.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"
infix_finditer = r'''[-~()=]|(?<=[a-zA-Z0-9])*^ |(?<=[a-zA-Z]):<>/ |(?<=()[^)]*(?=))'''
Do you have another suggestion what I might add/change?
Welcome to the forum @kaiser
Sourcing the custom tokenizer when passing --base_model
is currently not automated. You'd need to modify it via the config.cfg
file.
If you already have your modified base_model as package, you could try adding the following to the config file:
# Inside your .cfg file
...
[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "your_base_model"
vocab = "your_base_model"
...
Alternatively, you can:
Define your modification as a registered callback:
# functions.py
from spacy.util import registry, compile_infix_regex
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
@registry.callbacks("customize_tokenizer")
def make_customize_tokenizer():
def customize_tokenizer(nlp):
infix_rules = nlp.Defaults.infixes + [r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS)]
infix_re = spacy.util.compile_infix_regex(infix_rules)
nlp.tokenizer.infix_finditer = infix_re.finditer
return customize_tokenizer
Reference this callback in the config file so that it runs before the pipeline initialization
[initialize]
[initialize.before_init]
@callbacks = "customize_tokenizer"
See spaCy docs for details: Training Pipelines & Models ยท spaCy Usage Documentation
Then you'd run train
like so:
python -m prodigy train ./output -n ner_dataset --config config.cfg -F functions.py
where functions.py
contains the registered callback for modifying the tokenizer.
Please note that you can generate the config with spacy init config
https://spacy.io/usage/training/#quickstart