I'm using a custom huggingface tokenizer, which I've wrapped in a class which implements __call__(self, text):
and returns return Doc(self.nlp.vocab, words=words, spaces=spaces)
Generally the huggingface tokenizer's ignore whitespace and control characters - I have managed to create one which doesn't but I'm not entirely happy with it - I tried Metaspace
which appends '▁' to the start of words which preceded a space and treats control chars as text not whitespace.
The issue I'm running into is I want to concatenate two fields (from a parent and child db table) with \n\t but since my tokeniser doesn't return tokens for control characters they're not being displayed.
i.e. using just the hugginface pretokeniser (my full pipeline has a BPE model following pre-tokenization)
example = 'AC DRIVE 117001\n\tCURRENT'
pre_tok = Sequence([Whitespace(), Digits(individual_digits=False), Punctuation()])
ws = pre_tok.pre_tokenize_str(example)
print(was)
[('AC', (0, 2)), ('DRIVE', (3, 8)), ('117001', (9, 15)), ('CURRENT', (17, 24))]
I can use the second Doc
constructor to pass the words
and spaces
but it appears spacy expects a '\n' and '\t' token?
One way is for me to add them myself but I'm wondering if there's a better way of doing this?
I have considered trying to replicate my huggingface tokeniser in spacy but I get the impression a spacy tokeniser is the same as a huggingface pre-tokeniser and the BPE is down stream in spacy (word piecer?). I need to pre-tokenise and word-piece before annotation as my patterns rely on the final tokenization before vectorisation, not the pre-tokenisation.