Solutions to the problem that are available so far create other unwanted modifications to the rules to tokenize. For example, some methods would not then split “can’t” like words. Some methods would split text into sentences at every dot. The ideal solution that I need and am looking for would be that which does not split any intra-hyphen words and doesn’t create any other unwanted changes to the split rules whatsoever.
Following is an example of problematic solutions available here;
import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]
Output:
['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.']
Question: How does r’’’[.,?:;…\‘\’`\“\”"’~]’’’ solve the problem of intra-hyphen words when there is no hyphen in there?
The above solution creates, for example, the following problem (incorrect tokenization of “can’t”);
doc = nlp("This can't be it.")
print ([token.text for token in doc])
print ([(token.text, token.tag_) for token in doc])
Output:
['This', 'can', "'", 't', 'be', 'it', '.']
[('This', 'DT'), ('can', 'MD'), ("'", '``'), ('t', 'NN'), ('be', 'VB'), ('it', 'PRP'), ('.', '.')]
Expected output:
['This', 'ca', "n't", 'be', 'it', '.']
[('This', 'DT'), ('ca', 'MD'), ("n't", 'RB'), ('be', 'VB'), ('it', 'PRP'), ('.', '.')]