How to keep newline when use bert.ner.manual?

linb · March 24, 2022, 7:51pm

After I used bert.ner.manual,

prodigy bert.ner.manual ner_reddit ./small.jsonl --label PLATFORM-AWS,PLATFORM-AZURE --tokenizer-vocab ./bert-base-uncased-vocab.txt --lowercase --hide-wp-prefix --hide-special -F transformers_tokenizers.py

UI shows:

It seems like the newlines are ignroed.

How to keep the newline in this case?

The expected UI result should be:

ljvmiranda921 · March 28, 2022, 2:22am

Hi @linb , my hunch here is that the BERT Tokenizer (BertWordPieceTokenizer) from HF ignores the newlines from the text, that's why they're not showing up in Prodigy. My suggestion is to either create a custom recipe and substitute that tokenizer with spaCy's default one. Of course, the alignments between your transformer model and spaCy's tokenizer may be different.

The above procedure might cause destructive tokenization, and might be harder to recover the original data from the tokenized output. If you want non-destructive tokenization, you can train a transformer-based pipeline with spaCy v3, which will take care of aligning the transformer tokenization out-of-the-box.

Topic		Replies	Views
Can't find recipe or command 'bert.ner.manual' usage , ner , solved , transformers	4	591	September 23, 2022
BERT recipe when using transformer in pipeline? spacy , solved	8	1902	May 21, 2021
transformers model for NER ner , spacy	6	405	October 31, 2023
data-to-spacy is not using my custom tokenizer ner , spacy	7	1081	May 15, 2023
Misalignment for tokenization when use ner.llm.fetch and bert.ner.manual ner	8	33	March 12, 2025

How to keep newline when use bert.ner.manual?

Related topics