We're using a tokenizer which breaks up hyphenated words and all punctuation. The prodigy UI inserts whitespace between all these tokens, which makes the sentences harder to read. Example: anti-PD-1/PD-L1 antibodies is displayed as anti - PD - 1 / PD - L1 antibodies
We need to use our custom tokenizer to enable fine-grained annotation. We have confirmed that the extra whitespace being introduced by Prodigy, since our tokenizer has correct character offsets for "start" and "end." Is there a way to disable the extra whitespace?
Yes, by default, Prodigy will add a visual space between the tokens. For the tokens that are split on a space, this is necessary – and for all other onces, it’s often nice as a visual indicator.
I’ve been thinking about how to add an option to customise this and I think a good solution might be to allow a boolean "whitespace" property on token objects that can be set to false explicitly? This would also be very consistent with spaCy’s API and how the whitespace is preserved in Token.whitespace.
That would be great! Depending on the semantics, I might call it “use_minimum_whitespace” since, presumably, tokens which are followed by whitespace characters would always display their following whitespace even if the new property was set to “false” (unless you want to override the character offsets).
Another option would be to use a float value which is a percent of the normal space width. So 0.5 would allow someone to have a visual indicator, without as much disruption. The default would be 1.0 on all tokens which would mirror existing behavior.
Yeah, the "whitespace" property would definitely have to default to True (just like within spaCy). Most tokens are followed by whitespace, since they’re split on space – but some aren’t. So the decision can’t be “whitespace or no whitespace” and it has to be more like “original whitespace or always whitepsace/separated”.
As a test, I might implement this as an option to the add_tokens preprocessor. If enabled, each of the added "tokens" will then include the boolean whitespace info, and the app can render them accordingly.
Was the whitespace=True/False property in add_tokens ever implemented? I tried adding whitespace=False as an argument to add_tokens but appears to be unrecognized.
No, it was mostly at the idea stage and it never came up again, so I kinda forgot and it never made it onto my "enhancement proposals to implement" list. (We typically comment on issues that were implemented and change the label to "done" so the status is clear.)
Just reread the thread and it's actually a pretty straightforward implementation and totally something we should have. So it's on my list now
Just released Prodigy v1.10, which now respects a "ws" property on each token, indicating whether it's followed by whitespace or not. The whitespace info is added automatically when the incoming text is tokenized by Prodigy, and you can turn the behaviour off if needed by setting "honor_token_whitespace": false. Here's an example of how this looks for wordpiece tokens