Train recipe uses different Tokenizer than in ner.manual

magdaaniol · August 8, 2023, 1:15pm

Welcome to the forum @kaiser

Sourcing the custom tokenizer when passing --base_model is currently not automated. You'd need to modify it via the config.cfg file.
If you already have your modified base_model as package, you could try adding the following to the config file:

# Inside your .cfg file
...
[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "your_base_model"
vocab = "your_base_model"
...

Alternatively, you can:

Define your modification as a registered callback:

# functions.py
from spacy.util import registry, compile_infix_regex
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS

@registry.callbacks("customize_tokenizer")
def make_customize_tokenizer():
    def customize_tokenizer(nlp):
        infix_rules = nlp.Defaults.infixes + [r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS)]
        infix_re = spacy.util.compile_infix_regex(infix_rules)
        nlp.tokenizer.infix_finditer = infix_re.finditer
    return customize_tokenizer

Reference this callback in the config file so that it runs before the pipeline initialization

[initialize]

[initialize.before_init]
@callbacks = "customize_tokenizer"

See spaCy docs for details: Training Pipelines & Models · spaCy Usage Documentation

Then you'd run trainlike so:
python -m prodigy train ./output -n ner_dataset --config config.cfg -F functions.py
where functions.py contains the registered callback for modifying the tokenizer.

Please note that you can generate the config with spacy init confighttps://spacy.io/usage/training/#quickstart

Topic		Replies	Views
Prodigy is losing my tokeniser usage , spacy	2	419	February 18, 2022
How to modify the tokenizer used by Prodigy's recipes? usage , spacy	2	1060	March 27, 2018
Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy ner , spacy	8	341	February 19, 2024
`prodigy train` doesn't seem to use the tokenizer from base-model training	2	307	May 1, 2023
Roadmap of having a unified model for tokenizing, NER and dependency parsing using Prodigy ner , spacy , custom , training	1	418	July 7, 2023

Train recipe uses different Tokenizer than in ner.manual

Related topics