I am running a pretty standard span annotation task, using the spans.manual recipe. However, my data requires a very specific tokenization process, which is already implemented via a dedicated (python) code.
The documentation states that I can pass a " Loadable spaCy pipeline for tokenization" when running the annotation task; however, my tokenizer is not implemented as a spaCy pipeline, and I would like to use my existing tokenization code as-is.
I have tried applying the tokenizer on the input data, then re-joining it using white spaces, and running prodigy with "blank:en", hoping that the blank model just performs whit-space tokenization. However, it seems as the blank model does more than just white-space tokenization (for example, certain punctuation marks are tokenized), effectively splitting some of the tokens in my data.
Can someone please advise on the best way to use an already implemented non-spaCy tokenizer with a built-in recipe?