How to modify the tokenizer used by Prodigy's recipes?

When using ner.manual or ner.make_gold recipes, sometimes the tokenizer is not giving the desired results. For example, the following text:

How to play Pink Floyd- "Wish You Were Here"

Contains a typo "-" (which should have space before it), but the tokenizer creates the tokens Pink and Floyd-, so there is no way to mark the entity Pink Floyd in the Prodigy UI without the trailing "-".

I understand there is a way to customize the tokenizer's prefixes/suffixes in code but how can I do it to work with Prodigy's recipes?

Is there a way to update the model's prefixes/suffixes list and save it back to the model? So anything that uses the new model correctly tokenizes the example above without the need to create a custom tokenizer each time in code?

Thanks.

Yes, this is actually the main reason Prodigy recipes always take a model and don't just use the tokenizer or language classes shipped with spaCy directly. The idea is that you can create a custom base model for your own "dialect" of English (music language on YouTube).

When you save out a model via nlp.to_disk, the tokenizer is also serialized. This includes the prefix/suffix/infix rules, as well as the tokenizer exceptions. See here for the respective code in tokenizer.pyx. So you could overwrite nlp.tokenizer.suffix_search with your own function and then save out the model again.

That worked perfectly, thanks.

1 Like