Misalignment for tokenization when use ner.llm.fetch and bert.ner.manual

magdaaniol · March 6, 2025, 10:17am

Welcome to the forum @Fangjian

Not sure if you've seen our docs on annotation for BERT-like transformers, but spaCy v3 takes care of aligning linguistic tokenization (produced by spaCy tokenizers) to BERT tokenization before training.

So if you don't need to annotate the data that is already BERT-tokenized, you can work with spaCy default tokenizer in your Prodigy annotation workflows i.e. ner.llm.fetch and ner.manual and once you're done and ready to train a transformer pipeline you can export your data with data-to-spacy and use it for training with spaCy.
This post details each step in detail.
That is if you're planning to train a spaCy model. Let me know if that's not the case!

Topic		Replies	Views
BERT recipe when using transformer in pipeline? spacy , solved	8	1902	May 21, 2021
transformers model for NER ner , spacy	6	404	October 31, 2023
data-to-spacy is not using my custom tokenizer ner , spacy	7	1081	May 15, 2023
is there a way to change prodigy annotations to transformers-based annotations, without re-annotating? usage , ner , solved , transformers	6	821	March 4, 2021
Transform annotations to match tokenization required for SpanBERT/BERT spacy , transformers , spancat	19	1600	July 30, 2023

Misalignment for tokenization when use ner.llm.fetch and bert.ner.manual

Related topics