Need explanation of `ws` key in tokens field of annotations exported in jsonl file

Hi guys!

I have already annotated a dataset and exported the annotations using prodigy's db-out command.

Now, I want to increase the size of dataset by programatically augmenting the spans and correspondingly also the tokens as I know the different set of values which my named entities can take.

For this, I need to understand that in the tokens field in the jsonl file exported using db-out, there's a key for every token called ws.

Can you explain what this key stands for?

Thanks & Regards,
Vinayak.

Hi! "ws" stands for "whitespace" and indicates whether the token is followed by a space or not, just like Token.whitespace_ in spaCy. This allows you to reconstruct the original text from the tokens, and it can be used in the UI to display tokens in a more readable way. (If you leave out the key in the data you load it, it defaults to true = followed by whitespace).

You can see an example of this here: https://prodi.gy/docs/named-entity-recognition#transformers-tokenizers

1 Like

Thanks @ines. This is helpful! :slight_smile: