Thanks for the reports! This is strange... my initial thought was that the problem here was related to a change in the UI, but now I'm starting to think that it could be related to
add_tokens and that it somehow produces incomplete data for existing spans (which would explain why it occurs in
window.prodigy.content in your browser's development console to view the current task JSON.
Edit: Okay, nevermind, I think I can reproduce it!
Edit 2: Alright, I think I got it. This was very interesting, because it turned out to have nothing to do with the actual interface at all There must have been some changes in the
add_tokens logic that cause the tokens to (sometimes?) receive incorrect
"id" values, e.g. the first token will be
"id": 2. Still investigating that. But this also means that the good news is, there can be a temporary workaround because you can just fix the IDs.
Edit 3: Wow, this was subtle! The true cuplrit: sentence segmentation It all makes sense now (because I was really scratching my head about the UI and there was just nothing that could have explained the problem). Some background on what happens: we refactored the tokenization logic to support character-based spans and tokens. As part of that, we made the token's
"id" call into spaCy's
Token.i (token index in the
Doc, makes a lot of sense). However, when sentences are segmented, that index reflects the sentence's first token in the doc, which could easily be
10 or whatever. This caused the tokens to be out-of-sync.
Anyway, this is pretty good news, because the easiest workaround is to just set
--unsegmented (or pre-segment your text yourself if you want sentence segmentation).
We'll definitely get an updated release ready this week – should probably be able to do it all today, but don't want to overpromise.