Adding newline and tabs to annotation interface

I'm using a custom huggingface tokenizer, which I've wrapped in a class which implements __call__(self, text): and returns return Doc(self.nlp.vocab, words=words, spaces=spaces)

Generally the huggingface tokenizer's ignore whitespace and control characters - I have managed to create one which doesn't but I'm not entirely happy with it - I tried Metaspace which appends '▁' to the start of words which preceded a space and treats control chars as text not whitespace.

The issue I'm running into is I want to concatenate two fields (from a parent and child db table) with \n\t but since my tokeniser doesn't return tokens for control characters they're not being displayed.

i.e. using just the hugginface pretokeniser (my full pipeline has a BPE model following pre-tokenization)

example = 'AC DRIVE 117001\n\tCURRENT'
pre_tok = Sequence([Whitespace(), Digits(individual_digits=False), Punctuation()])
ws = pre_tok.pre_tokenize_str(example)

[('AC', (0, 2)), ('DRIVE', (3, 8)), ('117001', (9, 15)), ('CURRENT', (17, 24))]

I can use the second Doc constructor to pass the words and spaces but it appears spacy expects a '\n' and '\t' token?

One way is for me to add them myself but I'm wondering if there's a better way of doing this?

I have considered trying to replicate my huggingface tokeniser in spacy but I get the impression a spacy tokeniser is the same as a huggingface pre-tokeniser and the BPE is down stream in spacy (word piecer?). I need to pre-tokenise and word-piece before annotation as my patterns rely on the final tokenization before vectorisation, not the pre-tokenisation.

Adding to this - what's the "correct" spacy tokenisation if I want the control characters to correctly display in the annotation interface. I have "text":"AC DRIVE 117001\n\tOUTPUT FREQ"

I modified my tokenisation to include the "\n\t" as non-whitespace, it only seems to work if I create one token for each, i.e. "\n" + "\t" as below. When I created a single "\n\t" token the UI displayed the two special characters but only added the newline, it didn't tab the next line.

0:{'end': 2, 'id': 0, 'start': 0, 'text': 'AC', 'ws': True}
1:{'end': 8, 'id': 1, 'start': 3, 'text': 'DRIVE', 'ws': True}
2:{'end': 11, 'id': 2, 'start': 9, 'text': '11', 'ws': False}
3:{'end': 15, 'id': 3, 'start': 11, 'text': '7001', 'ws': False}
4:{'end': 16, 'id': 4, 'start': 15, 'text': '\n', 'ws': False, 'disabled':True}
5:{'end': 17, 'id': 5, 'start': 16, 'text': '\t', 'ws': False, 'disabled':True}
6:{'end': 23, 'id': 6, 'start': 17, 'text': 'OUTPUT', 'ws': True}
7:{'end': 28, 'id': 7, 'start': 24, 'text': 'FREQ', 'ws': False}

Hi David,

This issue goes all the way back to why spaCy is called "spaCy". I first saw the need for the library when I wanted to make a language learning tool that would use POS tags to calculate markup on html documents. To do that, I needed to figure out how the tags aligned to the original text...Which was really hard with the destructive tokenization used in nltk etc. My solution was that all characters of the input, including whitespace, should be represented in the tokenized text. Whitespace is handled just like punctuation in spaCy, the only characters that aren't part of the tokens themselves are the single whitespace separators, which are handled as a boolean flag.

To make everything work correctly, you'll need to make sure that the following invariant holds true: text == "".join(word.text_with_ws for word in doc). You'll have problems if the tokens in the Doc contain characters that aren't in the original text. You should also make sure you're calculating the spaces flags correctly. They need to be True only if the token needs to have a following " " in the string. You'll also need to strip out the control characters that the tokenizers often use to indicate that tokens aren't whitespace-separated.

I could be wrong, but my suspicion is that your tokens somehow don't align to your text correctly, and that's what's causing the problem. In spaCy, usually \t\n would actually be one token --- so it's suspicious that that's causing you problems.

By the way, depending on what you're doing, you might find the new version of spacy-transformers that we've developed for spaCy v3 helpful.

The way we're handling the whitespacing in spaCy v3 is to align against the Huggingface tokenization, with each token aligned to zero or more wordpieces. The alignment is stored in a ragged array, so you can always fetch out the indices that correspond to some slice of zero or more spaCy tokens.

Thanks @honnibal, asserting text == "".join(word.text_with_ws for word in doc) certainly helps me ensure everything is aligned - it looks correct for now.

This is my hugginface tokenizer

    model = BPE(unk_token=unk_token, end_of_word_suffix=suffix)
    tokenizer = Tokenizer(model)
    tokenizer.normalizer = Lowercase()
    tokenizer.pre_tokenizer = Sequence([Whitespace(), Digits(individual_digits=False), Punctuation()])
    tokenizer.decoder = BPEDecoder(suffix=suffix)

I for now I've dropped the idea of using '\n\t' so this code seems to work

    encoding = self.tokenizer.encode(text)
    words = []
    spaces = []

    for token, (start,end) in zip(encoding.tokens, encoding.offsets):
        word = text[start:end]
        spaces.append(token.endswith(suffix) and text[end:].startswith(' '))

I'll definitely take a look at spacy-transformers, at the moment I have a Bi-LSTM-CRF model in pytorch, using the above tokeniser then embedding (and the CRF is actually from allennlp) and I'm using prodigy to generate gold labels in order to extract NER's and then build a knowledge base. I plan on looking at a transformer model, but I cannot use pre-trained models as my input doesn't really overlap with the data they're trained from - it's IoT device descriptions from building management systems, it contains lots of industry jargon abbreviated and concatenated (sometimes with delimiters, sometimes without).

So If I could build something end to end in spaCy that would be nice. Thanks for the heads up!