ws vs disabled

I'm using prodigy for NER with some slightly unusual strings. There's (usually) no whitespace, instead punctuation is used to separate tokens. I've created a custom recipe with a huggingface tokenizer which uses BertPreTokenizer + Digits, then BPE (based on https://github.com/explosion/prodigy-recipes/blob/master/other/transformers_tokenizers.py)

I've modified it slightly, I don't want the user to be able to select seperators like "-" or "_" as the first / last word in an NER. But I also don't want the UI to treat them as whitespace or it doesn't display correctly. i.e. "fcu-3a-2-rm-temp" is displayed as "fcu - 3 a - 2 - rm - temp"

So in my recipe I set disabled=True for all non-text characters and ws=True only for actual spaces.

This works pretty well except double clicking - that seems to use ws as the boundary. Is there a way to make the ui double click select a sequence between the preceeding and following based on "disabled" rather than "ws" ?

{'text': 'fcu', 'id': 7, 'start': 7, 'end': 10, 'tokenizer_id': 126, 'disabled': False, 'ws': False}
{'text': '-', 'id': 8, 'start': 10, 'end': 11, 'tokenizer_id': 40, 'disabled': True, 'ws': False}
{'text': '3', 'id': 9, 'start': 11, 'end': 12, 'tokenizer_id': 46, 'disabled': False, 'ws': False}
{'text': 'a', 'id': 10, 'start': 12, 'end': 13, 'tokenizer_id': 66, 'disabled': False, 'ws': False}
{'text': '-', 'id': 11, 'start': 13, 'end': 14, 'tokenizer_id': 40, 'disabled': True, 'ws': False}
{'text': '2', 'id': 12, 'start': 14, 'end': 15, 'tokenizer_id': 45, 'disabled': False, 'ws': False}
{'text': '-', 'id': 13, 'start': 15, 'end': 16, 'tokenizer_id': 40, 'disabled': True, 'ws': False}
{'text': 'rm', 'id': 14, 'start': 16, 'end': 18, 'tokenizer_id': 162, 'disabled': False, 'ws': False}
{'text': '-', 'id': 15, 'start': 18, 'end': 19, 'tokenizer_id': 40, 'disabled': True, 'ws': False}
{'text': 'temp', 'id': 16, 'start': 19, 'end': 23, 'tokenizer_id': 139, 'disabled': False, 'ws': True}

Hi! I hope I understand your question correctly. Double-clicking always snaps to the token boundaries of whatever you double-clicked on (whether or not the token is followed by whitespace shouldn't have an impact here). But if you know what belongs together, you could just use that as the logical unit and create one token entry for multiple tokens produced by your tokenizer – so your token would be 3a instead of 3 and a. Then clicking on 3a would select the whole span. You can store the underlying token information with the JSON, so you'll always be able to reconstruct the true tokenization afterwards (it's one extra postprocessing step but should be easy because it's all generated programmatically – so you know it'll always match).

Hi Ines

What do you mean by token boundaries? My observation is the token boundary is whitespace but not ignored, certainly when I double click it definitely selects multiple tokens, the boundary seems to be either whitespace or the end of the preceding span or the start of the following span.

i.e. in the case you describe if '3' and 'a' are adjacent tokens then it will select '3a' regardless of how I tokenized. But if it's '3' ' ' 'a' (ie whitespace) it will select only 3 or a.

Ah, that's strange! Which version of Prodigy and which browser are you using?

If I double-click on "bi" in "bieber" in the example here, the UI will select only "bi" for me: Named Entity Recognition · Prodigy · An annotation tool for AI, Machine Learning & NLP That's how it should be – the selection should snap to the token boundaries, regardless of whether a token is followed by whitespace or not.

It does for me too, the difference is Bieber is surrounded by "ws":true (i.e. the "Justin" and "-" tokens)

    {"text": "justin", "id": 1, "start": 0, "end": 6, "tokenizer_id": 6796, "disabled": false, "ws": true},
    {"text": "bi", "id": 2, "start": 7, "end": 9, "tokenizer_id": 12170, "disabled": false, "ws": false},
    {"text": "eber", "id": 3, "start": 9, "end": 13, "tokenizer_id": 22669, "disabled": false, "ws": true},
    {"text": "-", "id": 4, "start": 14, "end": 15, "tokenizer_id": 1011, "disabled": false, "ws": true},

In my case all of the tokens have "ws":false but the "delimeters" like "-" have "disabled": true - I need to do this to render the original string properly and to prevent the user selecting a "word" which starts / ends with a delimiter (it's fine to select an infix delimiter - I'm identifying high level entities first then a second pass I want to resolve their idenitity).

I feel like I need a 3rd option, ws, disabled and word boundary. At the moment ws seems to be overloaded to represent both whitespace when displaying in the UI, and as word boundaries when double clicking.

I've been trying to reproduce this using the example from the first post in the thread and in Firefox

Interestingly, Chrome doesn't seem to handle native double-clicks in inline elements the same way if the text is rendered as white-space: normal. So I'm assuming this is what causes the behaviour you're seeing.

A quick workaround for this for now to give you the double-click back in Chrome would be to set the following CSS overrides. I will cause the tokens to be displayed with more space in between visually, though.

"global_css": ".prodigy-content span[id] { white-space: pre-wrap }"