display of tokens without spaces


(Mike Ross) #1

We’re using a tokenizer which breaks up hyphenated words and all punctuation. The prodigy UI inserts whitespace between all these tokens, which makes the sentences harder to read. Example: anti-PD-1/PD-L1 antibodies is displayed as anti - PD - 1 / PD - L1 antibodies

We need to use our custom tokenizer to enable fine-grained annotation. We have confirmed that the extra whitespace being introduced by Prodigy, since our tokenizer has correct character offsets for “start” and “end.” Is there a way to disable the extra whitespace?

(Ines Montani) #2

Yes, by default, Prodigy will add a visual space between the tokens. For the tokens that are split on a space, this is necessary – and for all other onces, it’s often nice as a visual indicator.

I’ve been thinking about how to add an option to customise this and I think a good solution might be to allow a boolean "whitespace" property on token objects that can be set to false explicitly? This would also be very consistent with spaCy’s API and how the whitespace is preserved in Token.whitespace.

(Mike Ross) #3

That would be great! Depending on the semantics, I might call it “use_minimum_whitespace” since, presumably, tokens which are followed by whitespace characters would always display their following whitespace even if the new property was set to “false” (unless you want to override the character offsets).

Another option would be to use a float value which is a percent of the normal space width. So 0.5 would allow someone to have a visual indicator, without as much disruption. The default would be 1.0 on all tokens which would mirror existing behavior.

(Ines Montani) #4

Yeah, the "whitespace" property would definitely have to default to True (just like within spaCy). Most tokens are followed by whitespace, since they’re split on space – but some aren’t. So the decision can’t be “whitespace or no whitespace” and it has to be more like “original whitespace or always whitepsace/separated”.

As a test, I might implement this as an option to the add_tokens preprocessor. If enabled, each of the added "tokens" will then include the boolean whitespace info, and the app can render them accordingly.