I'm using prodigy for NER with some slightly unusual strings. There's (usually) no whitespace, instead punctuation is used to separate tokens. I've created a custom recipe with a huggingface tokenizer which uses BertPreTokenizer + Digits, then BPE (based on https://github.com/explosion/prodigy-recipes/blob/master/other/transformers_tokenizers.py)
I've modified it slightly, I don't want the user to be able to select seperators like "-" or "_" as the first / last word in an NER. But I also don't want the UI to treat them as whitespace or it doesn't display correctly. i.e. "fcu-3a-2-rm-temp" is displayed as "fcu - 3 a - 2 - rm - temp"
So in my recipe I set disabled=True for all non-text characters and ws=True only for actual spaces.
This works pretty well except double clicking - that seems to use ws as the boundary. Is there a way to make the ui double click select a sequence between the preceeding and following based on "disabled" rather than "ws" ?
{'text': 'fcu', 'id': 7, 'start': 7, 'end': 10, 'tokenizer_id': 126, 'disabled': False, 'ws': False}
{'text': '-', 'id': 8, 'start': 10, 'end': 11, 'tokenizer_id': 40, 'disabled': True, 'ws': False}
{'text': '3', 'id': 9, 'start': 11, 'end': 12, 'tokenizer_id': 46, 'disabled': False, 'ws': False}
{'text': 'a', 'id': 10, 'start': 12, 'end': 13, 'tokenizer_id': 66, 'disabled': False, 'ws': False}
{'text': '-', 'id': 11, 'start': 13, 'end': 14, 'tokenizer_id': 40, 'disabled': True, 'ws': False}
{'text': '2', 'id': 12, 'start': 14, 'end': 15, 'tokenizer_id': 45, 'disabled': False, 'ws': False}
{'text': '-', 'id': 13, 'start': 15, 'end': 16, 'tokenizer_id': 40, 'disabled': True, 'ws': False}
{'text': 'rm', 'id': 14, 'start': 16, 'end': 18, 'tokenizer_id': 162, 'disabled': False, 'ws': False}
{'text': '-', 'id': 15, 'start': 18, 'end': 19, 'tokenizer_id': 40, 'disabled': True, 'ws': False}
{'text': 'temp', 'id': 16, 'start': 19, 'end': 23, 'tokenizer_id': 139, 'disabled': False, 'ws': True}