Problem with multiple whitespaces

I have problems with spans annotation in rel.manual if multiple whitespaces are between tokens.
E.g.

[{
  "text": "I like baby cats because they're   cute",
  "tokens": [
    {"text": "I", "start": 0, "end": 1, "id": 0, "ws": true},
    {"text": "like", "start": 2, "end": 6, "id": 1, "ws": true},
    {"text": "baby", "start": 7, "end": 11, "id": 2, "ws": true},
    {"text": "cats", "start": 12, "end": 16, "id": 3, "ws": true},
    {"text": "because", "start": 17, "end": 24, "id": 4, "ws": true},
    {"text": "they", "start": 25, "end": 29, "id": 5, "ws": false},
    {"text": "'re", "start": 29, "end": 32, "id": 6, "ws": true},
    {"text": "cute", "start": 35, "end": 39, "id": 7, "ws": false}
  ],
  "spans": [
    {"start": 7, "end": 16, "token_start": 2, "token_end": 3, "label": "REF"},
    {"start": 25, "end": 37, "token_start": 5, "token_end": 7, "label": "REASON"},
    {"start": 35, "end": 39, "token_start": 7, "token_end": 7, "label": "ATTR"}
  ]
}]

image
After marking "cute":
image

Hi! Thanks for the report, this definitely looks like a problem with the rendering of spans in the UI! If you check the generated JSONL based on this example, are the mappings still correct or do you end up with invalid spans?

Output:

{
  "text":"I like baby cats because they're   cute",
  "tokens":[
    {
      "text":"I",
      "start":0,
      "end":1,
      "id":0,
      "ws":true,
      "disabled":false
    },
    {
      "text":"like",
      "start":2,
      "end":6,
      "id":1,
      "ws":true,
      "disabled":false
    },
    {
      "text":"baby",
      "start":7,
      "end":11,
      "id":2,
      "ws":true,
      "disabled":false
    },
    {
      "text":"cats",
      "start":12,
      "end":16,
      "id":3,
      "ws":true,
      "disabled":false
    },
    {
      "text":"because",
      "start":17,
      "end":24,
      "id":4,
      "ws":true,
      "disabled":false
    },
    {
      "text":"they",
      "start":25,
      "end":29,
      "id":5,
      "ws":false,
      "disabled":false
    },
    {
      "text":"'re",
      "start":29,
      "end":32,
      "id":6,
      "ws":true,
      "disabled":false
    },
    {
      "text":"cute",
      "start":33,
      "end":37,
      "id":7,
      "ws":false,
      "disabled":false
    }
  ],
  "spans":[
    {
      "start":33,
      "end":37,
      "token_start":7,
      "token_end":7,
      "label":"Y"
    }
  ]
}

So tokens have been changed in comparison to original JSON and the last token don't match the text. The span is consistent with the new token, but it should be fixed.

I need to have information about all whitespaces.

If I don't annotate anything the last token offsets are also changed (wrongly).

Ah, sorry, I didn't read the original sample correctly: the problem here is that the "tokens" don't match the text, so the interface gets confused. If you edit the text, the tokens need to match, and the tokenized version of your text would include an extra token for the whitespace.

This is also the output from spaCy's tokenizer and Prodigy's add_tokens preprocessor:

{
  "text": "I like baby cats because they're   cute", 
  "tokens": [
    {"text": "I", "start": 0, "end": 1, "id": 0, "ws": true}, 
    {"text": "like", "start": 2, "end": 6, "id": 1, "ws": true}, 
    {"text": "baby", "start": 7, "end": 11, "id": 2, "ws": true}, 
    {"text": "cats", "start": 12, "end": 16, "id": 3, "ws": true}, 
    {"text": "because", "start": 17, "end": 24, "id": 4, "ws": true}, 
    {"text": "they", "start": 25, "end": 29, "id": 5, "ws": false}, 
    {"text": "'re", "start": 29, "end": 32, "id": 6, "ws": true}, 
    {"text": "  ", "start": 33, "end": 35, "id": 7, "ws": false}, 
    {"text": "cute", "start": 35, "end": 39, "id": 8, "ws": false}
  ]
}
1 Like

Ok, thank you. So multiple whitespaces have to be tokens. Maybe some assert would be useful text==''.join([t.text+(' ' if t.ws else '') for t in tokens])