My use-case is slightly off the normal way NLP is used. I am trying to use it to analyze, understand and potentially summarize log files from networking devices, so that it can help bring down troubleshooting times. Because of this, my tokenization, NER and POS requirements are different.
Some backstory that I wrote up when I MAY have noticed something weird in Spacy:
https://github.com/explosion/spaCy/issues/2412. The link should explain a bit about what I am trying to do.
Network Login MAC user 68B599A71D20 logged in MAC 68:B5:99:A7:1D:20 port 20 VLANs EDLAB, authentication Radius
Custom Tokenization (plus Spacy output):
================================================================ Network network PROPN NNP noun, proper singular compound Xxxxx True False Login login PROPN NNP noun, proper singular compound Xxxxx True False MAC mac PROPN NNP noun, proper singular compound XXX True False user user NOUN NN noun, singular or mass compound xxxx True False admin admin NOUN NN noun, singular or mass nsubj xxxx True False logged log VERB VBN verb, past participle ROOT xxxx True False in in ADP IN conjunction, subordinating or preposition prep xx True True MAC mac PROPN NNP noun, proper singular compound XXX True False 68:B5:99:A7:1D:20 68:b5:99:a7:1d:20 PROPN NNP noun, proper singular nummod dd:Xd:dd:Xd:dX:dd False False port port NOUN NN noun, singular or mass pobj xxxx True False 20 20 NUM CD cardinal number nummod dd False False VLANs vlans PROPN NNP noun, proper singular compound XXXXx True False EDLAB edlab NOUN NN noun, singular or mass dobj XXXX True False through through ADP IN conjunction, subordinating or preposition prep xxxx True True ssh ssh PROPN NNP noun, proper singular nmod xxx True False 184.108.40.206, 220.127.116.11, NUM CD cardinal number nummod ddd.ddd.ddd.ddd, False False authentication authentication NOUN NN noun, singular or mass pobj xxxx True False Radius radius PROPN NNP noun, proper singular npadvmod Xxxxx True False ================================================================ Network Login MAC 0 17 PRODUCT MAC 39 42 ORG 68:B5:99:A7:1D:20 43 60 CARDINAL 20 66 68 CARDINAL EDLAB 75 80 ORG 18.104.22.168, 93 109 CARDINAL Radius 125 131 GPE ================================================================ Network compound MAC PROPN  Login compound MAC PROPN  MAC compound admin NOUN [Network, Login] user compound admin NOUN  admin nsubj logged VERB [MAC, user] logged ROOT logged VERB [admin, in, EDLAB, through, Radius] in prep logged VERB [port] MAC compound port NOUN  68:B5:99:A7:1D:20 nummod port NOUN  port pobj in ADP [MAC, 68:B5:99:A7:1D:20] 20 nummod EDLAB NOUN  VLANs compound EDLAB NOUN  EDLAB dobj logged VERB [20, VLANs] through prep logged VERB [authentication] ssh nmod authentication NOUN  22.214.171.124, nummod authentication NOUN  authentication pobj through ADP [ssh, 126.96.36.199,] Radius npadvmod logged VERB  ================================================================
Subsequently, when I am training my models in Prodigy, I would like Prodigy to learn network named entities and tag them as PROPNs contextually, akin to English names (it is happening by default in this example, but it does not always happen). Networking logs have a plethora of IP addresses, MAC addresses, key-value pairs etc.
I can deal with these easily in Spacy since I have regex support and my custom tokenization takes care of it (refer link above).
Doing the same thing with rule-based matching is hard due to the complex nature of regexes for some of these entities, for example an IP version 6 (IPv6) address. The regex itself is about 20 lines of complex matches. I cannot use Prodigy with my modified tokenizer (since the tokenizer is looking for a single builtin compiled match function for the regex in create_tokenizer).
Is there ANY WAY I can use regexes in the patterns.jsonl file to allow Prodigy to learn network entities? Even writing a simple MAC pattern in the rule-based matcher is kind of hard, since it involves both numbers and alphabets and I do not see a way of developing logical ORs in the rule.
For example, if we consider
68:B5:99:A7:1D:20, I cannot simply write a rule that says ORTH: “dd”, ORTH: “:” and so on, nor can I use shape because I do not know where a digit or an alphabet will occur, so Xd or dX needs to be enumerated for all possibilities in the six positions. I cannot even begin to explain the possibilities for an IPv6 address.
So, my problem is that I can teach Prodigy IPs and MACs in the current dataset using ner.manual and pos.make-gold, but when run against a different dataset, it will not recognize IPs and MACs due to them being learnt as actual fixed tokens and not names etc. I haven’t found a way to generalize this learning and am looking for some way to make that happen.
I hope this is true, but I do not find any documentation to help me with using regexes in patterns.jsonl. Any inputs are most welcome.