Hi,
I would like to customize the tokenizer from Spacy English model.
I have searched on stackoverflow, but still could not solve it.
There are certain tokens that split wrongly for my use case:
For example:
UPDATE 2-Buffett's Berkshire cuts crisis-era Goldman stake
tokenized into:
['UPDATE', '2-Buffett', "'s", 'Berkshire', 'cuts', 'crisis', '-', 'era', 'Goldman', 'stake']
but what I expect is:
['UPDATE', '2', '-', 'Buffett', "'s", 'Berkshire', 'cuts', 'crisis', '-', 'era', 'Goldman', 'stake']
The issue is on splitting digit-text.
Another example:
America's No. 1 Recruitment Site
tokenized into:
['America', "'s", 'No', '.', '1', 'Recruitment', 'Site']
but what I expect is:
['America', "'s", 'No.', '1', 'Recruitment', 'Site']
The issue is on combining No and .(dot).
Can I have code snippet for my problem please? Thanks.