Problem with multiple whitespaces

apohllo · January 27, 2022, 8:54pm

I have problems with spans annotation in rel.manual if multiple whitespaces are between tokens.
E.g.

[{
  "text": "I like baby cats because they're   cute",
  "tokens": [
    {"text": "I", "start": 0, "end": 1, "id": 0, "ws": true},
    {"text": "like", "start": 2, "end": 6, "id": 1, "ws": true},
    {"text": "baby", "start": 7, "end": 11, "id": 2, "ws": true},
    {"text": "cats", "start": 12, "end": 16, "id": 3, "ws": true},
    {"text": "because", "start": 17, "end": 24, "id": 4, "ws": true},
    {"text": "they", "start": 25, "end": 29, "id": 5, "ws": false},
    {"text": "'re", "start": 29, "end": 32, "id": 6, "ws": true},
    {"text": "cute", "start": 35, "end": 39, "id": 7, "ws": false}
  ],
  "spans": [
    {"start": 7, "end": 16, "token_start": 2, "token_end": 3, "label": "REF"},
    {"start": 25, "end": 37, "token_start": 5, "token_end": 7, "label": "REASON"},
    {"start": 35, "end": 39, "token_start": 7, "token_end": 7, "label": "ATTR"}
  ]
}]

After marking "cute":

ines · January 27, 2022, 9:15pm

Hi! Thanks for the report, this definitely looks like a problem with the rendering of spans in the UI! If you check the generated JSONL based on this example, are the mappings still correct or do you end up with invalid spans?

apohllo · January 28, 2022, 4:49pm

Output:

{
  "text":"I like baby cats because they're   cute",
  "tokens":[
    {
      "text":"I",
      "start":0,
      "end":1,
      "id":0,
      "ws":true,
      "disabled":false
    },
    {
      "text":"like",
      "start":2,
      "end":6,
      "id":1,
      "ws":true,
      "disabled":false
    },
    {
      "text":"baby",
      "start":7,
      "end":11,
      "id":2,
      "ws":true,
      "disabled":false
    },
    {
      "text":"cats",
      "start":12,
      "end":16,
      "id":3,
      "ws":true,
      "disabled":false
    },
    {
      "text":"because",
      "start":17,
      "end":24,
      "id":4,
      "ws":true,
      "disabled":false
    },
    {
      "text":"they",
      "start":25,
      "end":29,
      "id":5,
      "ws":false,
      "disabled":false
    },
    {
      "text":"'re",
      "start":29,
      "end":32,
      "id":6,
      "ws":true,
      "disabled":false
    },
    {
      "text":"cute",
      "start":33,
      "end":37,
      "id":7,
      "ws":false,
      "disabled":false
    }
  ],
  "spans":[
    {
      "start":33,
      "end":37,
      "token_start":7,
      "token_end":7,
      "label":"Y"
    }
  ]
}

So tokens have been changed in comparison to original JSON and the last token don't match the text. The span is consistent with the new token, but it should be fixed.

I need to have information about all whitespaces.

apohllo · January 28, 2022, 4:52pm

If I don't annotate anything the last token offsets are also changed (wrongly).

ines · January 28, 2022, 5:33pm

Ah, sorry, I didn't read the original sample correctly: the problem here is that the "tokens" don't match the text, so the interface gets confused. If you edit the text, the tokens need to match, and the tokenized version of your text would include an extra token for the whitespace.

This is also the output from spaCy's tokenizer and Prodigy's add_tokens preprocessor:

{
  "text": "I like baby cats because they're   cute", 
  "tokens": [
    {"text": "I", "start": 0, "end": 1, "id": 0, "ws": true}, 
    {"text": "like", "start": 2, "end": 6, "id": 1, "ws": true}, 
    {"text": "baby", "start": 7, "end": 11, "id": 2, "ws": true}, 
    {"text": "cats", "start": 12, "end": 16, "id": 3, "ws": true}, 
    {"text": "because", "start": 17, "end": 24, "id": 4, "ws": true}, 
    {"text": "they", "start": 25, "end": 29, "id": 5, "ws": false}, 
    {"text": "'re", "start": 29, "end": 32, "id": 6, "ws": true}, 
    {"text": "  ", "start": 33, "end": 35, "id": 7, "ws": false}, 
    {"text": "cute", "start": 35, "end": 39, "id": 8, "ws": false}
  ]
}

apohllo · February 1, 2022, 8:17am

Ok, thank you. So multiple whitespaces have to be tokens. Maybe some assert would be useful text==''.join([t.text+(' ' if t.ws else '') for t in tokens])

Topic		Replies	Views
Double-spaces preventing manual span annotations Getting Started	1	26	May 13, 2025
rel.manual not accepting entities because of tokenization ner , solved , relations	7	1056	April 17, 2024
whitespaces at the beginning of a line usage , ner , spacy	2	553	October 5, 2021
relation recipe missing span annotation on custom tokens because of tokenization didnt match relations , spancat	1	350	September 15, 2022
Disabled tokens in rel.manual not working enhancement , usage , relations	1	480	February 1, 2022

Problem with multiple whitespaces

Related topics