Deberta custom tokens are all joined (no spaces).

Fi_Vero · November 1, 2024, 5:14pm

I made an attempt to use my own tokens. The reason is I am doing a NER (token classification) and I want to have the tokens made by the model which I will use for NER.

Hence, I create my own list of tokens which have dictionaries; where each one contains: "start": , "end", "text", "id", and "ws", which refer to start index, end index, displayed text, token id and whitespace (true/false).

Now, I am using Deberta, Roberta style tokeniser which includes the whitespaces in front of tokens. So for instance, instead of ["Hello", "there"], it would be ["Hello", " there"], adding a leading space in front of "there".

Hence, I put everywhere "ws"=False, since the tokens already include spaces. However, when they get displayed they are all congested i.e. "Hellothere". And when I put the "ws" True, I get the added spaces by spaCy and the ones from the tokens i.e. "Hello there", 2 spaces instead of 1.

FYI, when I do


spaces = [False] * len(tokens)

doc = Doc(nlp.vocab, words=tokens, spaces=spaces)

The resulting doc, displays as it should, so it should be something with the front-end I suppose.

Display error (or behaviour) occurs with 1.15.6, 115.8 prodigy version that I tried.

How can I solve this?

magdaaniol · November 4, 2024, 2:54pm

Hi @Fi_Vero ,

The reason why the whitespaces you define implictily via " " character are not rendered is that they are being ignored by HTML.
The only way to define whitespaces between tokens is via ws attribute.
One thing you could do to train with the deberta tokenizer, and I would recommend this way rather than defining custom tokens, is to annotate with Prodigy using the default spaCy linguistic tokenization, export the data with data-to-spacy with transformer-based config and use this data to train spaCy transformer pipeline.
spaCy will take care if aligning the linguistic tokens (and span offsets) to tokens produced by deberta. One important advantage of this approach is that it will be easier to swap out the transformer, if you want to try out different ones (that may also use different tokenizer).
You can find some more info on annotating fortransformers here: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP

Topic		Replies	Views
display of tokens without spaces enhancement , ner , done , front-end	6	1846	June 17, 2020
Adding newline and tabs to annotation interface usage , spacy , transformers	4	1508	November 13, 2020
whitespaces at the beginning of a line usage , ner , spacy	2	553	October 5, 2021
Adding custom whitespace characters ner , spacy	1	489	January 6, 2020
Whitespace tokens not displaying for some reason	3	135	November 21, 2023

Deberta custom tokens are all joined (no spaces).

Related topics