Tokens from 'Tokenizer' are different from 'en' model

I used tokenization at 2 different places. Once using Tokenizer class and another time taking out the tokens from the document. The tokens are different. Are they using two different tokenization algorithms? I was trying to automatically create the jsonl that can be consumed by prodigy and i was getting char offset wrong.

text = Pressured up to 35 bar on B-annulus. Pressured up to 175 bar with MP2 using packer fluid on csg side. Observed 17 bar pressure increase on B-annulus due to ballooning (35-52 bar) Attempted to pressure up B-annulus to 220 bar with cmt unit. Stopped pumping at 97 bar and only 16 ltrs pumped (theoretical volume to 97 bar = 280 ltr).

tokenizer = Tokenizer(spacy.load(‘en’).vocab)
tokens = [str(tok).lower() for tok in tokenizer(str(text))]

Output: [‘Pressured’, ‘up’, ‘to’, ‘35’, ‘bar’, ‘on’, ‘B-annulus.’, ‘Pressured’, ‘up’, ‘to’, ‘175’, ‘bar’, ‘with’, ‘MP2’, ‘using’, ‘packer’, ‘fluid’, ‘on’, ‘csg’, ‘side.’, ‘Observed’, ‘17’, ‘bar’, ‘pressure’, ‘increase’, ‘on’, ‘B-annulus’, ‘due’, ‘to’, ‘ballooning’, ‘(35-52’, ‘bar)’, ’ ', ‘Attempted’, ‘to’, ‘pressure’, ‘up’, ‘B-annulus’, ‘to’, ‘220’, ‘bar’, ‘with’, ‘cmt’, ‘unit.’, ‘Stopped’, ‘pumping’, ‘at’, ‘97’, ‘bar’, ‘and’, ‘only’, ‘16’, ‘ltrs’, ‘pumped’, ‘(theoretical’, ‘volume’, ‘to’, ‘97’, ‘bar’, ‘=’, ‘280’, ‘ltr).’]

nlp = spacy.load(‘en’)
doc = nlp(str(text))
tokens = [doc[i].text.lower() for i in range(len(doc))]
Output: [‘Pressured’, ‘up’, ‘to’, ‘35’, ‘bar’, ‘on’, ‘B’, ‘-’, ‘annulus’, ‘.’, ‘Pressured’, ‘up’, ‘to’, ‘175’, ‘bar’, ‘with’, ‘MP2’, ‘using’, ‘packer’, ‘fluid’, ‘on’, ‘csg’, ‘side’, ‘.’, ‘Observed’, ‘17’, ‘bar’, ‘pressure’, ‘increase’, ‘on’, ‘B’, ‘-’, ‘annulus’, ‘due’, ‘to’, ‘ballooning’, ‘(’, ‘35’, ‘-’, ‘52’, ‘bar’, ‘)’, ’ ', ‘Attempted’, ‘to’, ‘pressure’, ‘up’, ‘B’, ‘-’, ‘annulus’, ‘to’, ‘220’, ‘bar’, ‘with’, ‘cmt’, ‘unit’, ‘.’, ‘Stopped’, ‘pumping’, ‘at’, ‘97’, ‘bar’, ‘and’, ‘only’, ‘16’, ‘ltrs’, ‘pumped’, ‘(’, ‘theoretical’, ‘volume’, ‘to’, ‘97’, ‘bar’, ‘=’, ‘280’, ‘ltr’, ‘)’, ‘.’]

For the same ‘text’, both token set is different. why does this happen?

When i use the Tokenizer tokens to create jsonl for Prodigy, the learning fails with 0
When i use the tokens created from spacy-nlp doc - learning is successful.

I potentially see the problem is the start-end char. I am assuming the spacy-en is using a particular tokenizer which might be different from Tokenizer class. Is there a way to just call the tokenizer that is used by the spacy-en? the current way of deriving tokens takes a bit longer time for large text.


What you're observing here makes sense – under the hood, the tokenizer is powered by the Tokenizer class. But when you initialize the blank Tokenizer with only a vocabulary and no rules, it will really only split on whitespace. The English tokenizer (and other language-specific tokenizers) include a lot of additional punctuation rules and exceptions for things like abbreviations etc. So it's expected that they produce different output. You can find more details on this in the tokenization docs.

If you only want to tokenize, you definitely don't want to be loading and running the whole model and assign all other linguistic annotations. You can either use the English class directly, call nlp.tokenizer / nlp.make_doc or use the nlp.disable_pipes contextmanager if you already have the model loaded and just want to temporarily disable the other components.

I actually just posted some examples of this on Twitter, see here:

1 Like

Thank you @ines
This makes it a lot better on the running time.

1 Like