spaCy Tokenization issue

vaibhav-01 · August 17, 2021, 9:21am

Hi

This is actually an issue related to spaCy but I didn't found a support page for spaCy and hence posting it here. Any help will be much appreciated.

While creating a doc using

nlp = spacy.load("en_core_web_sm")
doc = nlp("I have $10K")

Here, when I am printing our tokens in the doc, the output is

["I", "have", "$", "10", "K"]

but I want the output to be the following and it should be a standard tokenization technique as well

["I", "have", "$", "10K"]

Any thoughts on how to achieve this?

ines · August 17, 2021, 11:12pm

Hi! We try to keep this forum very focused on Prodigy – for general usage questions around spaCy, the discussion forum is usually a better place: Discussions · explosion/spaCy · GitHub

Also see the documentation on adding special case rules and customizing the tokenizer rule sets for reference: https://spacy.io/usage/linguistic-features#special-cases

Topic		Replies	Views
Custom English Tokenizer usage , spacy	0	533	May 7, 2019
Tokens from 'Tokenizer' are different from 'en' model usage , spacy , solved	2	750	April 3, 2019
Custom Tokenizer help ner , spacy	1	320	December 23, 2022
Adding newline and tabs to annotation interface usage , spacy , transformers	4	1508	November 13, 2020
Add tokenization rule usage , spacy	4	729	May 15, 2020

spaCy Tokenization issue

Related topics