So I follow the description in the above post and actually got it to work like this:
1. Write custom tokenizer
I created a file utils/my_tokenizer.py
which looks like this:
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex, compile_prefix_regex, compile_suffix_regex
def my_tokenizer(nlp):
infixes = nlp.Defaults.infixes + [
r",", # Always split on commas
r"(?<=\d)(?=[A-Za-z])", # Split between digits and letters
r"(?<=[a-z])(?=[A-Z])", # Split where lowercase precedes uppercase
r"(?<=[A-Za-z])(?=\d)", # Split where letters precede digits
r"/", # Split at every slash
]
infix_re = compile_infix_regex(nlp.Defaults.infixes + infixes)
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
2. Prepare custom tokenizer pipeline for Prodigy
In order to be able to use the custom tokenizer within prodigy I wrote the tokenizer to a pipeline using a script named write_tokenizer_pipeline.py
:
import spacy
from utils import my_tokenizer
# Load a blank German pipeline
nlp = spacy.blank("de")
# Replace the default tokenizer with the customized one
nlp.tokenizer = my_tokenizer(nlp)
nlp.to_disk('./tokenizer/my_tokenizer')
And then just calling python scripts/python/write_tokenizer_pipeline.py
which writes the custom tokenizer pipeline to the tokenizer/my_tokenizer
directory.
3. Start prodigy and do labeling
I started Prodigy like this and got to labeling:
prodigy ner.manual dataset_name ./tokenizer/my_tokenizer/ corpus/my_examples.jsonl --label LABEL1,LABEL2,LABEL3
4. Write a spaCy config from the custom tokenizer
In order to prepare training I then wrote it to a spaCy config:
prodigy spacy-config custom_tok.cfg --ner dataset_name --base-model ./tokenizer/my_tokenizer/
The above article states that you new to change some things in the configs, after doing this - but for me using prodigy v1.17.0 2024-11-18 no changes were needed.
5. Export datasets for spaCy
Since spaCy needs the binary format for training I exported the examples like this:
prodigy data-to-spacy custom_tok_output --ner dataset_name --config custom_tok.cfg --base-model ./tokenizer/my_tokenizer
This creates a new directory named custom_tok_output
with a spaCy config and the dev and train datasets.
6. Train a model with spaCy
After this I could train a new model like this:
spacy train custom_tok_output/config.cfg --paths.train custom_tok_output/train.spacy --paths.dev custom_tok_output/dev.spacy