Hi @ines,
My use-case is slightly off the normal way NLP is used. I am trying to use it to analyze, understand and potentially summarize log files from networking devices, so that it can help bring down troubleshooting times. Because of this, my tokenization, NER and POS requirements are different.
Some backstory that I wrote up when I MAY have noticed something weird in Spacy: https://github.com/explosion/spaCy/issues/2412
. The link should explain a bit about what I am trying to do.
Example Log:
Network Login MAC user 68B599A71D20 logged in MAC 68:B5:99:A7:1D:20 port 20 VLANs EDLAB, authentication Radius
Custom Tokenization (plus Spacy output):
================================================================
Network network PROPN NNP noun, proper singular compound Xxxxx True False
Login login PROPN NNP noun, proper singular compound Xxxxx True False
MAC mac PROPN NNP noun, proper singular compound XXX True False
user user NOUN NN noun, singular or mass compound xxxx True False
admin admin NOUN NN noun, singular or mass nsubj xxxx True False
logged log VERB VBN verb, past participle ROOT xxxx True False
in in ADP IN conjunction, subordinating or preposition prep xx True True
MAC mac PROPN NNP noun, proper singular compound XXX True False
68:B5:99:A7:1D:20 68:b5:99:a7:1d:20 PROPN NNP noun, proper singular nummod dd:Xd:dd:Xd:dX:dd False False
port port NOUN NN noun, singular or mass pobj xxxx True False
20 20 NUM CD cardinal number nummod dd False False
VLANs vlans PROPN NNP noun, proper singular compound XXXXx True False
EDLAB edlab NOUN NN noun, singular or mass dobj XXXX True False
through through ADP IN conjunction, subordinating or preposition prep xxxx True True
ssh ssh PROPN NNP noun, proper singular nmod xxx True False
128.119.240.169, 128.119.240.169, NUM CD cardinal number nummod ddd.ddd.ddd.ddd, False False
authentication authentication NOUN NN noun, singular or mass pobj xxxx True False
Radius radius PROPN NNP noun, proper singular npadvmod Xxxxx True False
================================================================
Network Login MAC 0 17 PRODUCT
MAC 39 42 ORG
68:B5:99:A7:1D:20 43 60 CARDINAL
20 66 68 CARDINAL
EDLAB 75 80 ORG
128.119.240.169, 93 109 CARDINAL
Radius 125 131 GPE
================================================================
Network compound MAC PROPN []
Login compound MAC PROPN []
MAC compound admin NOUN [Network, Login]
user compound admin NOUN []
admin nsubj logged VERB [MAC, user]
logged ROOT logged VERB [admin, in, EDLAB, through, Radius]
in prep logged VERB [port]
MAC compound port NOUN []
68:B5:99:A7:1D:20 nummod port NOUN []
port pobj in ADP [MAC, 68:B5:99:A7:1D:20]
20 nummod EDLAB NOUN []
VLANs compound EDLAB NOUN []
EDLAB dobj logged VERB [20, VLANs]
through prep logged VERB [authentication]
ssh nmod authentication NOUN []
128.119.240.169, nummod authentication NOUN []
authentication pobj through ADP [ssh, 128.119.240.169,]
Radius npadvmod logged VERB []
================================================================
Subsequently, when I am training my models in Prodigy, I would like Prodigy to learn network named entities and tag them as PROPNs contextually, akin to English names (it is happening by default in this example, but it does not always happen). Networking logs have a plethora of IP addresses, MAC addresses, key-value pairs etc.
I can deal with these easily in Spacy since I have regex support and my custom tokenization takes care of it (refer link above).
Doing the same thing with rule-based matching is hard due to the complex nature of regexes for some of these entities, for example an IP version 6 (IPv6) address. The regex itself is about 20 lines of complex matches. I cannot use Prodigy with my modified tokenizer (since the tokenizer is looking for a single builtin compiled match function for the regex in create_tokenizer).
Is there ANY WAY I can use regexes in the patterns.jsonl file to allow Prodigy to learn network entities? Even writing a simple MAC pattern in the rule-based matcher is kind of hard, since it involves both numbers and alphabets and I do not see a way of developing logical ORs in the rule.
For example, if we consider 68:B5:99:A7:1D:20
, I cannot simply write a rule that says ORTH: "dd", ORTH: ":" and so on, nor can I use shape because I do not know where a digit or an alphabet will occur, so Xd or dX needs to be enumerated for all possibilities in the six positions. I cannot even begin to explain the possibilities for an IPv6 address.
So, my problem is that I can teach Prodigy IPs and MACs in the current dataset using ner.manual and pos.make-gold, but when run against a different dataset, it will not recognize IPs and MACs due to them being learnt as actual fixed tokens and not names etc. I haven't found a way to generalize this learning and am looking for some way to make that happen.
I hope this is true, but I do not find any documentation to help me with using regexes in patterns.jsonl. Any inputs are most welcome.