How to modify the tokenizer used by Prodigy's recipes?

daniel · March 27, 2018, 1:30pm

When using ner.manual or ner.make_gold recipes, sometimes the tokenizer is not giving the desired results. For example, the following text:

How to play Pink Floyd- "Wish You Were Here"

Contains a typo "-" (which should have space before it), but the tokenizer creates the tokens Pink and Floyd-, so there is no way to mark the entity Pink Floyd in the Prodigy UI without the trailing "-".

I understand there is a way to customize the tokenizer's prefixes/suffixes in code but how can I do it to work with Prodigy's recipes?

Is there a way to update the model's prefixes/suffixes list and save it back to the model? So anything that uses the new model correctly tokenizes the example above without the need to create a custom tokenizer each time in code?

Thanks.

ines · March 27, 2018, 3:17pm

Yes, this is actually the main reason Prodigy recipes always take a model and don't just use the tokenizer or language classes shipped with spaCy directly. The idea is that you can create a custom base model for your own "dialect" of English (music language on YouTube).

When you save out a model via nlp.to_disk, the tokenizer is also serialized. This includes the prefix/suffix/infix rules, as well as the tokenizer exceptions. See here for the respective code in tokenizer.pyx. So you could overwrite nlp.tokenizer.suffix_search with your own function and then save out the model again.

daniel · March 27, 2018, 7:34pm

That worked perfectly, thanks.

Topic		Replies	Views
Train recipe uses different Tokenizer than in ner.manual ner	1	322	August 8, 2023
Prodigy is losing my tokeniser usage , spacy	2	419	February 18, 2022
How to define a custom Tokenizer when using prodigy? usage , spacy , solved	3	428	September 20, 2021
Having Issue with Spacy Train with custom tokenizer component which we used to annotate in prodigy ner , spacy	8	341	February 19, 2024
How to use customized spaCy model in Prodigy? ner , spacy	6	489	July 3, 2023

How to modify the tokenizer used by Prodigy's recipes?

Related topics