I used tokenization at 2 different places. Once using Tokenizer class and another time taking out the tokens from the document. The tokens are different. Are they using two different tokenization algorithms? I was trying to automatically create the jsonl that can be consumed by prodigy and i was getting char offset wrong.
text = Pressured up to 35 bar on B-annulus. Pressured up to 175 bar with MP2 using packer fluid on csg side. Observed 17 bar pressure increase on B-annulus due to ballooning (35-52 bar) Attempted to pressure up B-annulus to 220 bar with cmt unit. Stopped pumping at 97 bar and only 16 ltrs pumped (theoretical volume to 97 bar = 280 ltr).
Tokenizer:
tokenizer = Tokenizer(spacy.load(‘en’).vocab)
tokens = [str(tok).lower() for tok in tokenizer(str(text))]
Output: [‘Pressured’, ‘up’, ‘to’, ‘35’, ‘bar’, ‘on’, ‘B-annulus.’, ‘Pressured’, ‘up’, ‘to’, ‘175’, ‘bar’, ‘with’, ‘MP2’, ‘using’, ‘packer’, ‘fluid’, ‘on’, ‘csg’, ‘side.’, ‘Observed’, ‘17’, ‘bar’, ‘pressure’, ‘increase’, ‘on’, ‘B-annulus’, ‘due’, ‘to’, ‘ballooning’, ‘(35-52’, ‘bar)’, ’ ', ‘Attempted’, ‘to’, ‘pressure’, ‘up’, ‘B-annulus’, ‘to’, ‘220’, ‘bar’, ‘with’, ‘cmt’, ‘unit.’, ‘Stopped’, ‘pumping’, ‘at’, ‘97’, ‘bar’, ‘and’, ‘only’, ‘16’, ‘ltrs’, ‘pumped’, ‘(theoretical’, ‘volume’, ‘to’, ‘97’, ‘bar’, ‘=’, ‘280’, ‘ltr).’]
spaCy-Nlp:
nlp = spacy.load(‘en’)
doc = nlp(str(text))
tokens = [doc[i].text.lower() for i in range(len(doc))]
Output: [‘Pressured’, ‘up’, ‘to’, ‘35’, ‘bar’, ‘on’, ‘B’, ‘-’, ‘annulus’, ‘.’, ‘Pressured’, ‘up’, ‘to’, ‘175’, ‘bar’, ‘with’, ‘MP2’, ‘using’, ‘packer’, ‘fluid’, ‘on’, ‘csg’, ‘side’, ‘.’, ‘Observed’, ‘17’, ‘bar’, ‘pressure’, ‘increase’, ‘on’, ‘B’, ‘-’, ‘annulus’, ‘due’, ‘to’, ‘ballooning’, ‘(’, ‘35’, ‘-’, ‘52’, ‘bar’, ‘)’, ’ ', ‘Attempted’, ‘to’, ‘pressure’, ‘up’, ‘B’, ‘-’, ‘annulus’, ‘to’, ‘220’, ‘bar’, ‘with’, ‘cmt’, ‘unit’, ‘.’, ‘Stopped’, ‘pumping’, ‘at’, ‘97’, ‘bar’, ‘and’, ‘only’, ‘16’, ‘ltrs’, ‘pumped’, ‘(’, ‘theoretical’, ‘volume’, ‘to’, ‘97’, ‘bar’, ‘=’, ‘280’, ‘ltr’, ‘)’, ‘.’]
For the same ‘text’, both token set is different. why does this happen?
When i use the Tokenizer tokens to create jsonl for Prodigy, the learning fails with 0
When i use the tokens created from spacy-nlp doc - learning is successful.
I potentially see the problem is the start-end char. I am assuming the spacy-en is using a particular tokenizer which might be different from Tokenizer class. Is there a way to just call the tokenizer that is used by the spacy-en? the current way of deriving tokens takes a bit longer time for large text.
Arul.