Custom English Tokenizer

fernandop · May 7, 2019, 12:46pm

Hi,
I would like to customize the tokenizer from Spacy English model.
I have searched on stackoverflow, but still could not solve it.
There are certain tokens that split wrongly for my use case:

For example:

UPDATE 2-Buffett's Berkshire cuts crisis-era Goldman stake

tokenized into:

['UPDATE', '2-Buffett', "'s", 'Berkshire', 'cuts', 'crisis', '-', 'era', 'Goldman', 'stake']

but what I expect is:

['UPDATE', '2', '-', 'Buffett', "'s", 'Berkshire', 'cuts', 'crisis', '-', 'era', 'Goldman', 'stake']

The issue is on splitting digit-text.

Another example:

America's No. 1 Recruitment Site

tokenized into:

['America', "'s", 'No', '.', '1', 'Recruitment', 'Site']

but what I expect is:

['America', "'s", 'No.', '1', 'Recruitment', 'Site']

The issue is on combining No and .(dot).

Can I have code snippet for my problem please? Thanks.

Topic		Replies	Views
Custom Tokenizer help ner , spacy	1	328	December 23, 2022
Tokens from 'Tokenizer' are different from 'en' model usage , spacy , solved	2	767	April 3, 2019
spaCy Tokenization issue spacy , off-topic	1	401	August 17, 2021
Annotating strings without correct separation ner , best-practices	8	238	November 21, 2024
Help with tokenization numbers with units of measure usage , ner , spacy	3	2866	August 6, 2018

Custom English Tokenizer

Related topics