How to incorporate document metadata in spaCy 3.0?

jack-rory-staunton · March 9, 2021, 1:41am

Since upgrading to spaCy 3.0 I am having trouble figuring out how to do something similar to what is described here:

Could someone show an updated example?

Thanks!

ines · March 9, 2021, 2:54am

Hi! The general approach described in the thread you linked should still work the same in spaCy v3 – but you now don't need to include the hack of overwriting nlp.tokenizer and can just register a custom tokenizer by adding the @spacy.registry.tokenizers decorator to your function: https://spacy.io/usage/linguistic-features#custom-tokenizer-training

In the config, you can then write:

[nlp.tokenizer]
@tokenizers = "your_custom_tokenizer_name"

You also don't have to manually edit the __init__.py on your package anymore after running spacy package. Instead, you can also use the --code argument on the CLI and point it to the Python file containing your custom functions. It will then be packaged with the pipeline automatically. https://spacy.io/api/cli#package

Topic		Replies	Views
Migration from spaCy 2.3 to 3.x + Annotating data in prodigy usage , spacy	1	459	August 29, 2021
How to incorporate document metadata? usage , spacy	9	3815	February 24, 2019
How to define a custom Tokenizer when using prodigy? usage , spacy , solved	3	429	September 20, 2021
Including document-level, non-textual metadata in model training usage , textcat	1	603	December 5, 2019
spaCy Tokenization issue spacy , off-topic	1	391	August 17, 2021

How to incorporate document metadata in spaCy 3.0?

Related topics