Custom Tokenization Support for Spacy (and by extension Prodigy).

@honnibal,
Creating this as a separate searchable thread as you directed. So, my use-case isn’t exactly the regular thing you come across. I am trying to mine network and system logs for data so that it can be used to identify issues, predict problem points and so on. Logs are natural language sequences (given that they are written by programmers), but are semi-structured and require a few different rules. The main constraint I have it custom tokenization, since normal rules of punctuation do not apply, for the most part. Logs have IP addresses, MAC addresses, key-value pairs etc for which I need to retain punctuation for context.

Spacy provides for using a regex pattern to apply when parsing, but I found examples with only one regex pattern. I need to support multiple patterns. I currently do this in a hack-y sort of way, by modifying tokenizer.pyx so that it accepts TOKEN_MATCH as an iterable rather than a variable.

I cannot use Spacy’s rule-based matching because enumerating some of the possibilities for regexes (such as IPv6 addresses) is well-nigh impossible. I also tried creating a custom tokenizer based on information I found somewhere on the site, but it isn’t working. Here is what I have currently:

TOKEN_MATCH = [re.compile(URL_PATTERN, re.UNICODE),
               re.compile(MAC_PATTERN, re.UNICODE),
               re.compile(IPV4_PATTERN, re.UNICODE),
               re.compile(IPV6_PATTERN, re.UNICODE),
               re.compile(PROCESS_PATTERN, re.UNICODE),
               re.compile(FILE_PATTERN, re.UNICODE),
               re.compile(HYPHENATED_PATTERN, re.UNICODE),
               re.compile(KEY_VALUE_PATTERN, re.UNICODE)
               ]

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[-~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab,
                     prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,
                     token_match=TOKEN_MATCH)

Each of the TOKEN_MATCH entries have their defined regex patterns and I would like to look for these when parsing. Also, I would like to avoid inconsistent tokenization issues like the one shown below (for which I cannot identify the issue at hand). The first instance is tokenized correctly on the word boundary, but the second one is not, for some reason. The formats, punctuations, spaces are all consistent. Any help would be greatly appreciated (since this would help avoid hacking the code every time I need to run Prodigy. Thanks @ines and @honnibal.

TEXT: Client ZZ:ZZ:ZZ:ZZ:ZZ:ZZ is failed to authenticate, failure count is 1.
TOKENIZATION:

Client client NOUN NN noun, singular or mass compound Xxxxx True False
### MAC address tokenization is correct, at the word boundary ###
ZZ:ZZ:ZZ:ZZ:ZZ:ZZ ZZ:ZZ:ZZ:ZZ:ZZ:ZZ NUM CD cardinal number nsubjpass dd:dd:dx:dd:dd:xd False False
is be VERB VBZ verb, 3rd person singular present auxpass xx True False
failed fail VERB VBN verb, past participle ccomp xxxx True False
to to PART TO infinitival to aux xx True False
authenticate authenticate VERB VB verb, base form xcomp xxxx True False
, , PUNCT , punctuation mark, comma punct , False False
failure failure NOUN NN noun, singular or mass compound xxxx True False
count count NOUN NN noun, singular or mass nsubj xxxx True False
is be VERB VBZ verb, 3rd person singular present ROOT xx True False
1 1 NUM CD cardinal number attr d False False
. . PUNCT . punctuation mark, sentence closer punct . False False

TEXT: Network Login MAC user XXXX logged in MAC AB:CD:EF:GH:IJ:KL port 12 VLAN(s) “xyz”, authentication Radius
TOKENIZATION:

Network network PROPN NNP noun, proper singular compound Xxxxx True False
Login login PROPN NNP noun, proper singular compound Xxxxx True False
MAC mac PROPN NNP noun, proper singular compound XXX True False
user user NOUN NN noun, singular or mass nsubj xxxx True False
XXXX XXXX NUM CD cardinal number appos dddXXXXddXXX False False
logged log VERB VBD verb, past tense ROOT xxxx True False
in in ADP IN conjunction, subordinating or preposition prep xx True False
MAC mac PROPN NNP noun, proper singular pobj XXX True False
### Improper tokenization ###
AB:CD AB:CD NUM CD cardinal number npadvmod dd:dX False False
: : PUNCT : punctuation mark, colon or ellipsis punct : False False
EF ef PROPN NNP noun, proper singular dep XX True False
: : PUNCT : punctuation mark, colon or ellipsis punct : False False
GH:IJ gh:ij NOUN NN noun, singular or mass appos Xd:dX False False
: : PUNCT : punctuation mark, colon or ellipsis punct : False False
KL kl NOUN NN noun, singular or mass compound XX True False
### MAC Address boundary ###
port port NOUN NN noun, singular or mass appos xxxx True False
12 12 NUM CD cardinal number nummod dd False False
VLAN(s vlan(s NOUN NN noun, singular or mass appos XXXX(x False False
) ) PUNCT -RRB- right round bracket punct ) False False
" " PUNCT `` opening quotation mark punct " False False
xyz xyz PROPN NNP noun, proper singular appos XXXX True False
" " PUNCT '' closing quotation mark punct " False False
, , PUNCT , punctuation mark, comma punct , False False
authentication authentication NOUN NN noun, singular or mass compound xxxx True False
Radius radius PROPN NNP noun, proper singular appos Xxxxx True False

Sorry for the delay getting to this — I’ve been travelling.

There are a number of places you could plug in to customize that token match expression. Simplest first:

  1. Can you just smoosh everything into one disjunctive expression? Like, if you have (?:dog|cat), that’ll match dog or cat. Can’t you just do that with all your subexpressions?

  2. If you can’t do it that way, you should be able to pass a custom function into the token_match argument. It could be anything, really. You can also assign this after creation if it’s more convenient, by writing to nlp.tokenizer.token_match

  3. You can assign a custom object to nlp.tokenizer instead. For basic usage, you should be able to just support the __call__ and pipe methods.

  4. Finally, if you subclass the English class, you can overwrite the make_doc method, so that you use an entirely custom tokenization strategy.

For reference, I think this might have already been solved here:

@honnibal,
Thank you for taking the time out to clarify these things. @ines helped me a great deal with the tokenization issues and they seem to be okay for now. I am just detailing my observations and what I have done, to keep a record in case we need to revisit this somewhere down the line.

Also, please find my observations inline.

I will keep you both posted if I run into anything major. And once again, thank you both for your invaluable inputs. :pray: