Using a costume tokenizer while annotating using a built-in recipe (spans.manual)

shuki · September 1, 2024, 1:28pm

Hi,

I am running a pretty standard span annotation task, using the spans.manual recipe. However, my data requires a very specific tokenization process, which is already implemented via a dedicated (python) code.

The documentation states that I can pass a " Loadable spaCy pipeline for tokenization" when running the annotation task; however, my tokenizer is not implemented as a spaCy pipeline, and I would like to use my existing tokenization code as-is.

I have tried applying the tokenizer on the input data, then re-joining it using white spaces, and running prodigy with "blank:en", hoping that the blank model just performs whit-space tokenization. However, it seems as the blank model does more than just white-space tokenization (for example, certain punctuation marks are tokenized), effectively splitting some of the tokens in my data.

Can someone please advise on the best way to use an already implemented non-spaCy tokenizer with a built-in recipe?

Thanks

ines · September 3, 2024, 10:55am

Hi and welcome!

I think in your case, the simplest solution would be to feed in already pre-tokenized text, using your existing tokenizer. If "tokens" are available in the input data, Prodigy will respect them and render the text accordingly. You can see an example of the JSON here, which is pretty straightforward: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP

Aside from the character offsets, the "ws" property lets you indicate whether the token is followed by whitespace or not. (You can also mark tokens as disabled, which can be helpful if you know in advance that certain tokens can never be part of a span, which will prevent them from being selectable and can improve data quality.)

shuki · September 4, 2024, 7:31am

Hi,

Thank you for your reply (and for the warm welcome :)), I will follow your suggestion.

Topic		Replies	Views
Using Prodigy to annotate data and train a tokenizer, or to fix the default tokenizer. spacy , custom	4	1348	March 11, 2020
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	555	March 27, 2020
Tokenization compatibility issues in rel.manual enhancement , usage , done , transformers , relations	7	1436	September 8, 2020
Tokenizer when training without base model training	3	506	December 14, 2022
ner.train on data not annotated by Spacy? ner	3	1150	June 11, 2018

Using a costume tokenizer while annotating using a built-in recipe (spans.manual)

Related topics