Need explanation of `ws` key in tokens field of annotations exported in jsonl file

ElisonSherton · September 21, 2020, 5:33am

Hi guys!

I have already annotated a dataset and exported the annotations using prodigy's db-out command.

Now, I want to increase the size of dataset by programatically augmenting the spans and correspondingly also the tokens as I know the different set of values which my named entities can take.

For this, I need to understand that in the tokens field in the jsonl file exported using db-out, there's a key for every token called ws.

Can you explain what this key stands for?

Thanks & Regards,
Vinayak.

ines · September 21, 2020, 7:38am

Hi! "ws" stands for "whitespace" and indicates whether the token is followed by a space or not, just like Token.whitespace_ in spaCy. This allows you to reconstruct the original text from the tokens, and it can be used in the UI to display tokens in a more readable way. (If you leave out the key in the data you load it, it defaults to true = followed by whitespace).

You can see an example of this here: https://prodi.gy/docs/named-entity-recognition#transformers-tokenizers

ElisonSherton · September 21, 2020, 7:41am

Thanks @ines. This is helpful!

Topic		Replies	Views
Deberta custom tokens are all joined (no spaces). ner , front-end	1	15	November 4, 2024
Annotation with WordPiece tokens usage , transformers	3	492	July 30, 2021
display of tokens without spaces enhancement , ner , done , front-end	6	1842	June 17, 2020
How to export my annotations?	2	14	March 3, 2025
Using a costume tokenizer while annotating using a built-in recipe (spans.manual)	2	22	September 4, 2024

Need explanation of `ws` key in tokens field of annotations exported in jsonl file

Related topics