Deberta custom tokens are all joined (no spaces).

I made an attempt to use my own tokens. The reason is I am doing a NER (token classification) and I want to have the tokens made by the model which I will use for NER.

Hence, I create my own list of tokens which have dictionaries; where each one contains: "start": , "end", "text", "id", and "ws", which refer to start index, end index, displayed text, token id and whitespace (true/false).

Now, I am using Deberta, Roberta style tokeniser which includes the whitespaces in front of tokens. So for instance, instead of ["Hello", "there"], it would be ["Hello", " there"], adding a leading space in front of "there".

Hence, I put everywhere "ws"=False, since the tokens already include spaces. However, when they get displayed they are all congested i.e. "Hellothere". And when I put the "ws" True, I get the added spaces by spaCy and the ones from the tokens i.e. "Hello there", 2 spaces instead of 1.

FYI, when I do


spaces = [False] * len(tokens)

doc = Doc(nlp.vocab, words=tokens, spaces=spaces)

The resulting doc, displays as it should, so it should be something with the front-end I suppose.

Display error (or behaviour) occurs with 1.15.6, 115.8 prodigy version that I tried.

How can I solve this?

Hi @Fi_Vero ,

The reason why the whitespaces you define implictily via " " character are not rendered is that they are being ignored by HTML.
The only way to define whitespaces between tokens is via ws attribute.
One thing you could do to train with the deberta tokenizer, and I would recommend this way rather than defining custom tokens, is to annotate with Prodigy using the default spaCy linguistic tokenization, export the data with data-to-spacy with transformer-based config and use this data to train spaCy transformer pipeline.
spaCy will take care if aligning the linguistic tokens (and span offsets) to tokens produced by deberta. One important advantage of this approach is that it will be easier to swap out the transformer, if you want to try out different ones (that may also use different tokenizer).
You can find some more info on annotating fortransformers here: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP