display of tokens without spaces

mikeross · March 7, 2019, 7:31pm

We're using a tokenizer which breaks up hyphenated words and all punctuation. The prodigy UI inserts whitespace between all these tokens, which makes the sentences harder to read. Example: anti-PD-1/PD-L1 antibodies is displayed as anti - PD - 1 / PD - L1 antibodies

We need to use our custom tokenizer to enable fine-grained annotation. We have confirmed that the extra whitespace being introduced by Prodigy, since our tokenizer has correct character offsets for "start" and "end." Is there a way to disable the extra whitespace?

ines · March 7, 2019, 8:03pm

Yes, by default, Prodigy will add a visual space between the tokens. For the tokens that are split on a space, this is necessary – and for all other onces, it’s often nice as a visual indicator.

I’ve been thinking about how to add an option to customise this and I think a good solution might be to allow a boolean "whitespace" property on token objects that can be set to false explicitly? This would also be very consistent with spaCy’s API and how the whitespace is preserved in Token.whitespace.

mikeross · March 7, 2019, 9:39pm

That would be great! Depending on the semantics, I might call it “use_minimum_whitespace” since, presumably, tokens which are followed by whitespace characters would always display their following whitespace even if the new property was set to “false” (unless you want to override the character offsets).

Another option would be to use a float value which is a percent of the normal space width. So 0.5 would allow someone to have a visual indicator, without as much disruption. The default would be 1.0 on all tokens which would mirror existing behavior.

ines · March 8, 2019, 10:23am

Yeah, the "whitespace" property would definitely have to default to True (just like within spaCy). Most tokens are followed by whitespace, since they’re split on space – but some aren’t. So the decision can’t be “whitespace or no whitespace” and it has to be more like “original whitespace or always whitepsace/separated”.

As a test, I might implement this as an option to the add_tokens preprocessor. If enabled, each of the added "tokens" will then include the boolean whitespace info, and the app can render them accordingly.

mcllstr · March 23, 2020, 4:19pm

Was the whitespace=True/False property in add_tokens ever implemented? I tried adding whitespace=False as an argument to add_tokens but appears to be unrecognized.

ines · March 23, 2020, 9:04pm

No, it was mostly at the idea stage and it never came up again, so I kinda forgot and it never made it onto my "enhancement proposals to implement" list. (We typically comment on issues that were implemented and change the label to "done" so the status is clear.)

Just reread the thread and it's actually a pretty straightforward implementation and totally something we should have. So it's on my list now

ines · June 17, 2020, 4:51pm

Just released Prodigy v1.10, which now respects a "ws" property on each token, indicating whether it's followed by whitespace or not. The whitespace info is added automatically when the incoming text is tokenized by Prodigy, and you can turn the behaviour off if needed by setting "honor_token_whitespace": false. Here's an example of how this looks for wordpiece tokens

Topic		Replies	Views
Whitespace tokens not displaying for some reason	3	135	November 21, 2023
Disable extra large spaces	1	237	September 26, 2023
Preserve preceding whitespaces at the beginning of a line usage	1	435	October 5, 2021
whitespaces at the beginning of a line usage , ner , spacy	2	552	October 5, 2021
Deberta custom tokens are all joined (no spaces). ner , front-end	1	16	November 4, 2024

display of tokens without spaces

Related topics