How to incorporate document metadata in spaCy 3.0?

Since upgrading to spaCy 3.0 I am having trouble figuring out how to do something similar to what is described here:

Could someone show an updated example?

Thanks!

Hi! The general approach described in the thread you linked should still work the same in spaCy v3 – but you now don't need to include the hack of overwriting nlp.tokenizer and can just register a custom tokenizer by adding the @spacy.registry.tokenizers decorator to your function: https://spacy.io/usage/linguistic-features#custom-tokenizer-training

In the config, you can then write:

[nlp.tokenizer]
@tokenizers = "your_custom_tokenizer_name"

You also don't have to manually edit the __init__.py on your package anymore after running spacy package. Instead, you can also use the --code argument on the CLI and point it to the Python file containing your custom functions. It will then be packaged with the pipeline automatically. https://spacy.io/api/cli#package