Using a costume tokenizer while annotating using a built-in recipe (spans.manual)

Hi,

I am running a pretty standard span annotation task, using the spans.manual recipe. However, my data requires a very specific tokenization process, which is already implemented via a dedicated (python) code.

The documentation states that I can pass a " Loadable spaCy pipeline for tokenization" when running the annotation task; however, my tokenizer is not implemented as a spaCy pipeline, and I would like to use my existing tokenization code as-is.

I have tried applying the tokenizer on the input data, then re-joining it using white spaces, and running prodigy with "blank:en", hoping that the blank model just performs whit-space tokenization. However, it seems as the blank model does more than just white-space tokenization (for example, certain punctuation marks are tokenized), effectively splitting some of the tokens in my data.

Can someone please advise on the best way to use an already implemented non-spaCy tokenizer with a built-in recipe?

Thanks

Hi and welcome! :wave:

I think in your case, the simplest solution would be to feed in already pre-tokenized text, using your existing tokenizer. If "tokens" are available in the input data, Prodigy will respect them and render the text accordingly. You can see an example of the JSON here, which is pretty straightforward: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP

Aside from the character offsets, the "ws" property lets you indicate whether the token is followed by whitespace or not. (You can also mark tokens as disabled, which can be helpful if you know in advance that certain tokens can never be part of a span, which will prevent them from being selectable and can improve data quality.)

Hi,

Thank you for your reply (and for the warm welcome :)), I will follow your suggestion.