Before diving deeper into this question I just want to make sure that I understand what your goal is. If you're trying to train a BERT model, you can also use spaCy without having to resort to this custom recipe. To quote the docs:
New in Prodigy v1.11 and spaCy v3
spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use
data-to-spacy
to export your annotations and train with spaCy v3 and a transformer-based config directly, or runtrain
and provide the config via the--config
argument.
So just to check, are you trying to train a BERT model using spaCy? If so, you might just want to follow the steps that I describe here. If you're trying to generate data for another library, like Huggingface, that depends on the sentencepiece tokeniser ... then I can dive a bit deeper.