I made an attempt to use my own tokens. The reason is I am doing a NER (token classification) and I want to have the tokens made by the model which I will use for NER.
Hence, I create my own list of tokens which have dictionaries; where each one contains: "start": , "end", "text", "id", and "ws", which refer to start index, end index, displayed text, token id and whitespace (true/false).
Now, I am using Deberta, Roberta style tokeniser which includes the whitespaces in front of tokens. So for instance, instead of ["Hello", "there"], it would be ["Hello", " there"], adding a leading space in front of "there".
Hence, I put everywhere "ws"=False, since the tokens already include spaces. However, when they get displayed they are all congested i.e. "Hellothere". And when I put the "ws" True, I get the added spaces by spaCy and the ones from the tokens i.e. "Hello there", 2 spaces instead of 1.
FYI, when I do
spaces = [False] * len(tokens)
doc = Doc(nlp.vocab, words=tokens, spaces=spaces)
The resulting doc, displays as it should, so it should be something with the front-end I suppose.
Display error (or behaviour) occurs with 1.15.6, 115.8 prodigy version that I tried.
How can I solve this?