How to tell SpaCy not to split any intra-hyphen words?

Solutions to the problem that are available so far create other unwanted modifications to the rules to tokenize. For example, some methods would not then split “can’t” like words. Some methods would split text into sentences at every dot. The ideal solution that I need and am looking for would be that which does not split any intra-hyphen words and doesn’t create any other unwanted changes to the split rules whatsoever.

Following is an example of problematic solutions available here;

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
[token.text for token in doc]

Output:

['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.']

Question: How does r’’’[.,?:;…\‘\’`\“\”"’~]’’’ solve the problem of intra-hyphen words when there is no hyphen in there?

The above solution creates, for example, the following problem (incorrect tokenization of “can’t”);

doc = nlp("This can't be it.")
print ([token.text for token in doc])
print ([(token.text, token.tag_) for token in doc]) 

Output:

['This', 'can', "'", 't', 'be', 'it', '.']
[('This', 'DT'), ('can', 'MD'), ("'", '``'), ('t', 'NN'), ('be', 'VB'), ('it', 'PRP'), ('.', '.')]

Expected output:

['This', 'ca', "n't", 'be', 'it', '.']
[('This', 'DT'), ('ca', 'MD'), ("n't", 'RB'), ('be', 'VB'), ('it', 'PRP'), ('.', '.')]

The reason your custom tokenizer implementation doesn’t split exceptions like "can't" anymore is that you’re not actually passing in any rules. Those are the tokenizer exceptions that define special cases, like contractions in English. Also see the Tokenizer API for the possible arguments.

The infix rules define the rules to determine how to split inside a token. If the include a hyphen, spaCy will always split intra-hyphen words. If they only match a hyphen within two capital letters, spaCy will only split hyphens between capital letters. If they include no hyphens, spaCy won’t split on hyphens at all.

If you only want to make small modifications to your rules like that, you might find it easier to add to an existing rule set instead of creating an entirely new tokenizer from scratch.

1 Like

Can I just take out the rule to split intra-hyphen words from the set of infix rules?

for infix_ in nlp.Defaults.infixes:
    if '-' in infix_:
        print (infix_)

Output:

(?<=[0-9])[+\-\*^](?=[0-9-])
(?<=[[[\p{Ll}&&\p{Latin}]||[ёа-я]||[әөүҗңһ]||[α-ωάέίόώήύ]||[\p{L}&&\p{Bengali}]||[\p{L}&&\p{Hebrew}]||[\p{L}&&\p{Arabic}]||[\p{L}&&\p{Sinhala}]]])\.(?=[[[\p{Lu}&&\p{Latin}]||[ЁА-Я]||[ӘӨҮҖҢҺ]||[Α-ΩΆΈΊΌΏΉΎ]||[\p{L}&&\p{Bengali}]||[\p{L}&&\p{Hebrew}]||[\p{L}&&\p{Arabic}]||[\p{L}&&\p{Sinhala}]]])
(?<=[[[\p{Lu}&&\p{Latin}]||[ЁА-Я]||[ӘӨҮҖҢҺ]||[Α-ΩΆΈΊΌΏΉΎ]||[\p{Ll}&&\p{Latin}]||[ёа-я]||[әөүҗңһ]||[α-ωάέίόώήύ]||[\p{L}&&\p{Bengali}]||[\p{L}&&\p{Hebrew}]||[\p{L}&&\p{Arabic}]||[\p{L}&&\p{Sinhala}]]]),(?=[[[\p{Lu}&&\p{Latin}]||[ЁА-Я]||[ӘӨҮҖҢҺ]||[Α-ΩΆΈΊΌΏΉΎ]||[\p{Ll}&&\p{Latin}]||[ёа-я]||[әөүҗңһ]||[α-ωάέίόώήύ]||[\p{L}&&\p{Bengali}]||[\p{L}&&\p{Hebrew}]||[\p{L}&&\p{Arabic}]||[\p{L}&&\p{Sinhala}]]])
(?<=[[[\p{Lu}&&\p{Latin}]||[ЁА-Я]||[ӘӨҮҖҢҺ]||[Α-ΩΆΈΊΌΏΉΎ]||[\p{Ll}&&\p{Latin}]||[ёа-я]||[әөүҗңһ]||[α-ωάέίόώήύ]||[\p{L}&&\p{Bengali}]||[\p{L}&&\p{Hebrew}]||[\p{L}&&\p{Arabic}]||[\p{L}&&\p{Sinhala}]]])[?";:=,.]*(?:-|–|—|--|---|——|~)(?=[[[\p{Lu}&&\p{Latin}]||[ЁА-Я]||[ӘӨҮҖҢҺ]||[Α-ΩΆΈΊΌΏΉΎ]||[\p{Ll}&&\p{Latin}]||[ёа-я]||[әөүҗңһ]||[α-ωάέίόώήύ]||[\p{L}&&\p{Bengali}]||[\p{L}&&\p{Hebrew}]||[\p{L}&&\p{Arabic}]||[\p{L}&&\p{Sinhala}]]])
(?<=[[[\p{Lu}&&\p{Latin}]||[ЁА-Я]||[ӘӨҮҖҢҺ]||[Α-ΩΆΈΊΌΏΉΎ]||[\p{Ll}&&\p{Latin}]||[ёа-я]||[әөүҗңһ]||[α-ωάέίόώήύ]||[\p{L}&&\p{Bengali}]||[\p{L}&&\p{Hebrew}]||[\p{L}&&\p{Arabic}]||[\p{L}&&\p{Sinhala}]]"])[:<>=/](?=[[[\p{Lu}&&\p{Latin}]||[ЁА-Я]||[ӘӨҮҖҢҺ]||[Α-ΩΆΈΊΌΏΉΎ]||[\p{Ll}&&\p{Latin}]||[ёа-я]||[әөүҗңһ]||[α-ωάέίόώήύ]||[\p{L}&&\p{Bengali}]||[\p{L}&&\p{Hebrew}]||[\p{L}&&\p{Arabic}]||[\p{L}&&\p{Sinhala}]]])

How can I do that then? I can’t find any specific portion from the infixes that I need to remove.

If I add anything to the set of suffix rules, there is gonna be mistagging somewhere. I therefore think it would be great if I could “just take out the rule to split intra-hyphen words from the set of infix rules”.

For example;

suffixes = nlp.Defaults.suffixes + (r'''\w+-\w+''',)
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

The above code solves the problem of intra-hyphen words. However, I fear, this might have caused issues somewhere. In order to verify, I’ll test it. But I would like to know if there was a better way. ^^

Yeah, the thing with tokenization is that there are always trade-offs. The default rules try to optimise for the best possible compromise of performance and accuracy – but there’ll always be cases where it over-segments or under-segments. Rule-based tokenization (like spaCy’s) can make it more difficult to handle context-specific cases, but statistical tokenization is often much slower and instead makes mistakes in different places.

The rule you’re looking for is on line 4 of the rules you posted. It takes care of splitting various hyphens and punctuation characters if they occur between letters. There are no single rules that only handle single characters, because that’s not very efficient. So there’s also not a single option to “take out character X”.

If there are specific cases that you know will always have to be tokenized a certain way, you could also add a custom pipeline component that merges or splits them afterwards.

1 Like

Hi

I am very new to spacy/prodigy so apologies if I am missing something obvious, but I am unsure how to use a custom tokeniser in prodigy. I also need to ensure my text is not split up on hyphens, and the custom tokeniser similar to above works fine for me on spacy. When I am trying to annotate my text on prodigy, most of my entities that I am trying to label are getting split up into multiple tokens due to embedded hyphens. How can i use a custom tokeniser in prodigy (ner,train), or do I have to tokenise my text in spacy and then import into prodigy?

thanks in advance
Tushar

If you’re modifying the tokenization rules in spaCy and then saving the nlp object with nlp.to_disk, those rules will be saved with the model and loaded back in when you load the model. So once you’ve created your nlp object with custom tokenization rules, you could call nlp.to_disk('/path/to/model'). In Prodigy, you could then use /path/to/model as the base model, instead of en_core_web_sm or whichever model you’re using.