Off-track use of Prodigy/Spacy - Custom Regex Pattern Matching and Modeling

Hi @ines,

My use-case is slightly off the normal way NLP is used. I am trying to use it to analyze, understand and potentially summarize log files from networking devices, so that it can help bring down troubleshooting times. Because of this, my tokenization, NER and POS requirements are different.

Some backstory that I wrote up when I MAY have noticed something weird in Spacy: https://github.com/explosion/spaCy/issues/2412. The link should explain a bit about what I am trying to do.

Example Log:
Network Login MAC user 68B599A71D20 logged in MAC 68:B5:99:A7:1D:20 port 20 VLANs EDLAB, authentication Radius

Custom Tokenization (plus Spacy output):

================================================================
Network network PROPN NNP noun, proper singular compound Xxxxx True False
Login login PROPN NNP noun, proper singular compound Xxxxx True False
MAC mac PROPN NNP noun, proper singular compound XXX True False
user user NOUN NN noun, singular or mass compound xxxx True False
admin admin NOUN NN noun, singular or mass nsubj xxxx True False
logged log VERB VBN verb, past participle ROOT xxxx True False
in in ADP IN conjunction, subordinating or preposition prep xx True True
MAC mac PROPN NNP noun, proper singular compound XXX True False
68:B5:99:A7:1D:20 68:b5:99:a7:1d:20 PROPN NNP noun, proper singular nummod dd:Xd:dd:Xd:dX:dd False False
port port NOUN NN noun, singular or mass pobj xxxx True False
20 20 NUM CD cardinal number nummod dd False False
VLANs vlans PROPN NNP noun, proper singular compound XXXXx True False
EDLAB edlab NOUN NN noun, singular or mass dobj XXXX True False
through through ADP IN conjunction, subordinating or preposition prep xxxx True True
ssh ssh PROPN NNP noun, proper singular nmod xxx True False
128.119.240.169, 128.119.240.169, NUM CD cardinal number nummod ddd.ddd.ddd.ddd, False False
authentication authentication NOUN NN noun, singular or mass pobj xxxx True False
Radius radius PROPN NNP noun, proper singular npadvmod Xxxxx True False
================================================================
Network Login MAC 0 17 PRODUCT
MAC 39 42 ORG
68:B5:99:A7:1D:20 43 60 CARDINAL
20 66 68 CARDINAL
EDLAB 75 80 ORG
128.119.240.169, 93 109 CARDINAL
Radius 125 131 GPE
================================================================
Network compound MAC PROPN []
Login compound MAC PROPN []
MAC compound admin NOUN [Network, Login]
user compound admin NOUN []
admin nsubj logged VERB [MAC, user]
logged ROOT logged VERB [admin, in, EDLAB, through, Radius]
in prep logged VERB [port]
MAC compound port NOUN []
68:B5:99:A7:1D:20 nummod port NOUN []
port pobj in ADP [MAC, 68:B5:99:A7:1D:20]
20 nummod EDLAB NOUN []
VLANs compound EDLAB NOUN []
EDLAB dobj logged VERB [20, VLANs]
through prep logged VERB [authentication]
ssh nmod authentication NOUN []
128.119.240.169, nummod authentication NOUN []
authentication pobj through ADP [ssh, 128.119.240.169,]
Radius npadvmod logged VERB []
================================================================

Subsequently, when I am training my models in Prodigy, I would like Prodigy to learn network named entities and tag them as PROPNs contextually, akin to English names (it is happening by default in this example, but it does not always happen). Networking logs have a plethora of IP addresses, MAC addresses, key-value pairs etc.

I can deal with these easily in Spacy since I have regex support and my custom tokenization takes care of it (refer link above).

Doing the same thing with rule-based matching is hard due to the complex nature of regexes for some of these entities, for example an IP version 6 (IPv6) address. The regex itself is about 20 lines of complex matches. I cannot use Prodigy with my modified tokenizer (since the tokenizer is looking for a single builtin compiled match function for the regex in create_tokenizer).

Is there ANY WAY I can use regexes in the patterns.jsonl file to allow Prodigy to learn network entities? Even writing a simple MAC pattern in the rule-based matcher is kind of hard, since it involves both numbers and alphabets and I do not see a way of developing logical ORs in the rule.

For example, if we consider 68:B5:99:A7:1D:20, I cannot simply write a rule that says ORTH: "dd", ORTH: ":" and so on, nor can I use shape because I do not know where a digit or an alphabet will occur, so Xd or dX needs to be enumerated for all possibilities in the six positions. I cannot even begin to explain the possibilities for an IPv6 address.

So, my problem is that I can teach Prodigy IPs and MACs in the current dataset using ner.manual and pos.make-gold, but when run against a different dataset, it will not recognize IPs and MACs due to them being learnt as actual fixed tokens and not names etc. I haven't found a way to generalize this learning and am looking for some way to make that happen.

I hope this is true, but I do not find any documentation to help me with using regexes in patterns.jsonl. Any inputs are most welcome.

Sorry about the confusing docs – I have no idea how the regular expressions ended up there, because no, this doesn’t yet work out of the box.

As you said, your application is a little non-standard, so it might take some experimentation to get this right. But it sounds like you’ve put a lot of thought into this already. At the moment, regular expressions aren’t natively supported in spaCy’s Matcher – but we’re working on a full overhaul of the API, which will allow more flexible matching, including regex patterns, custom attributes and set membership. See this thread for more details. In the meantime, here’s an example of how to use regular expressions in the current version.

As a first step, I’d suggest experimenting with your own regex-based matcher. Prodigy’s pattern matcher isn’t magic, and you can easily implement something similar yourself. Your matcher needs to be callable on a stream of incoming examples and yield annotation tasks with highlighted matches. If you look at the source of the recipe, you can see that Prodigy simply combines the NER model and the pattern matcher using the combine_models helper. This gets the results from both models and interleaves them. When the annotations come back, both models can be updated – so annotations based on your matches will feed into the entity recognizer. Here’s an example of how a matcher like this could look:

import re
import copy

def RegexMatcher(object):
    def __init__(self, expression, label):
        self.expression = re.compile(expression)
        self.label = label

    def __call__(self, batch):
        for eg in batch:
            for match in re.finditer(self.expression, eg['text']):  # find match in example text
                task = copy.deepcopy(eg)  # match found – copy the example
                start, end = match.span()  # get matched indices
                task['spans'] = [{'start': start, 'end': end, 'label': self.label}]  # label match
                yield 0.5, task  # (score, example) tuples

    def update(self, examples):
        # this is normally used for updating the model, but we're just
        # going to do nothing here and return 0, which will be added to
        # the loss returned by the model's update() method
        return 0

Edit: My previous code contained an error, because I forgot to yield (score, example) tuples. I’ve set the score to always be 0.5, but if you’re using a different sorter, you might want to adjust this.

In your recipe, you could then do the following:

model = EntityRecognizer(nlp, label=label)
matcher = RegexMatcher(r'...', 'MAC_ADDRESS')
predict, update = combine_models(model, matcher)
stream = prefer_uncertain(predict(stream))

Of course, you probably want to set your matcher up so it can take multiple patterns and labels – but I tried to keep the example code as straightforward as possible. The matcher can also implement other logic that determines whether to suggest an example for annotation, and how to present it.

How you proceed from here on really depends on how your experiments go. It’d be especially interesting to see how the model in the loop responds to the pattern annotations and how long it takes to get reasonable predictions (or even the first correctly predicted entity).

@ines,
Thank you for the pointers. This is what I came up with, in accordance with the documentation and ner recipes.

import re
import copy
import spacy
from prodigy.util import log
from prodigy.util import combine_models
from spacy.pipeline import EntityRecognizer
from prodigy.core import recipe, recipe_args
from prodigy.components.loaders import get_stream
from prodigy.components.sorters import prefer_uncertain
from prodigy.components.preprocess import split_sentences

MAC_PATTERN = r"(?:[0-9a-fA-F]{2}[-:]){5}(?:[0-9a-fA-F]{2})".strip()


class RegexMatcher(object):

    def __init__(self, expression, label):
        self.expression = re.compile(expression)
        self.label = label

    def __call__(self, batch):
        for eg in batch:
            for match in re.finditer(self.expression, eg['text']):  # find match in example text
                task = copy.deepcopy(eg)  # match found – copy the example
                start, end = match.span()  # get matched indices
                print(eg, start, end)
                task['spans'] = [{'start': start, 'end': end, 'label': self.label}]  # label match
                yield 0.5, task  # (score, example) tuples

    def update(self, examples):
        # this is normally used for updating the model, but we're just
        # going to do nothing here and return 0, which will be added to
        # the loss returned by the model's update() method
        return 0


@recipe('pattern.teach',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        label=recipe_args['label_set'],
        patterns=recipe_args['patterns'],
        exclude=recipe_args['exclude'],
        unsegmented=recipe_args['unsegmented'])
def pattern_teach(dataset, spacy_model, source=None, api=None, loader=None,
          label=None, patterns=None, exclude=None, unsegmented=False):
    """
    Collect the best possible training data for a named entity recognition
    model with the model in the loop. Based on your annotations, Prodigy will
    decide which questions to ask next.
    """
    log("RECIPE: Starting recipe pattern.teach", locals())
    # Initialize the stream, and ensure that hashes are correct, and examples
    # are deduplicated.
    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')
    # Create the model, using a pre-trained spaCy model.
    nlp = spacy.load(spacy_model)
    log("RECIPE: Creating EntityRecognizer using model {}".format(spacy_model))
    model = EntityRecognizer(nlp.vocab, label=label)
    if patterns is None:
        predict = model
        update = model.update
    else:
        matcher = RegexMatcher(MAC_PATTERN, 'MAC_ADDRESS')
        # matcher = PatternMatcher(model.nlp).from_disk(patterns)
        log("RECIPE: Created RegexMatcher and loaded in patterns", patterns)
        # Combine the NER model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        predict, update = combine_models(model, matcher)
        stream = prefer_uncertain(predict(stream))
        # Split the stream into sentences
    if not unsegmented:
        stream = split_sentences(nlp, stream)
    # Return components, to construct Controller
    return {
        'view_id': 'ner',
        'dataset': dataset,
        'stream': stream,
        'update': update,  # callback to update the model in-place
        'exclude': exclude
    }

For some reason, this starts the web-server, but there are no labels to select. Here’s a screenshot of what the web-server looks like.

Not quite sure if I am doing something wrong when passing the stream back to the recipe for processing. Thanks for taking a look.

Also, for some reason, I was not able to pass ‘nlp’ (the loaded spacy model) to EntityRecognizer. It kept complaining about the Vocab type being wrong (so had to pass nlp.vocab explicitly). It also complained about these sections in ner.teach, so had to replace them with alternatives.

model = EntityRecognizer(nlp, label=label) --> model = EntityRecognizer(nlp.vocab, label=label)

and

stream = split_sentences(model.orig_nlp, stream) --> stream = split_sentences(nlp, stream). I am not quite sure of how much of a difference it makes, since I was not even able to get my recipe to run without these changes.

Oh, so you actually want to use the manual NER interface and also correct the matches? Sorry, I didn't realise that. It's only a small change, though: Instead of 'view_id': 'ner', make the recipe return 'view_id': 'ner_manual'. And as the last component returned by the recipe, add the 'labels' to the 'config':

    # ...
    'exclude': exclude,
    'config': {
        'labels': ['MAC_ADDRESS', 'SOMETHING_ELSE']
    }

Ah, I think I know what your problem is: You're using spaCy's entity recognizer component, not Prodigy's EntityRecognizer model. Sorry if the naming was confusing here. But you want to use the model that creates annotation tasks and can be updated from annotations, not the generic entity recognizer that makes predictions in a spaCy pipeline. So just remove the following line:

from spacy.pipeline import EntityRecognizer

And replace it with:

from prodigy.models.ner import EntityRecognizer

This doesn't really matter. The model.orig_nlp is the original nlp object that the model is initialised with. In my example, I used nlp because it's more explicit and requires less explanation.

@ines,
Thank you for the quick response. That kind of fixed most of the issues. It was silly of me. I should have double checked for Prodigy’s EntityRecognizer (silly mistake).

So, when I make your suggested changes and run it, the web-server front-end says something went wrong with the web-server and it may be a bug ALTHOUGH verbose logging on the terminal suggests that it learns things, since it is printing out MAC addresses and their spans. That leads me to believe that the code per-se is working.

The funny thing is that I experimented by changing the view_id back to just ner and it was working fine, identifying MACs correctly, see below:

until it had to save the annotations, at which point it crapped out complaining about not finding a transition. Error is below.

12:50:34 - CONTROLLER: Receiving 21 answers
12:50:34 - MODEL: Merging entity spans of 21 examples
12:50:34 - MODEL: Using 21 examples (without 'ignore')
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-MAC_ADDRESS', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-MAC_ADDRESS', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-MAC_ADDRESS', 'I-MAC_ADDRESS', 'I-MAC_ADDRESS', 'I-MAC_ADDRESS', 'I-MAC_ADDRESS', 'I-MAC_ADDRESS', 'L-MAC_ADDRESS', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-MAC_ADDRESS', 'I-MAC_ADDRESS', 'L-MAC_ADDRESS', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-MAC_ADDRESS', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-MAC_ADDRESS', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-MAC_ADDRESS', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-MAC_ADDRESS', 'I-MAC_ADDRESS', 'I-MAC_ADDRESS', 'I-MAC_ADDRESS', 'L-MAC_ADDRESS', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
12:50:34 - Exception when serving /give_answers
Traceback (most recent call last):
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/waitress/channel.py", line 338, in service
    task.service()
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/waitress/task.py", line 169, in service
    self.execute()
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/waitress/task.py", line 399, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "hug/api.py", line 424, in hug.api.ModuleSingleton.__call__.api_auto_instantiate
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/falcon/api.py", line 244, in __call__
    responder(req, resp, **params)
  File "hug/interface.py", line 734, in hug.interface.HTTP.__call__
  File "hug/interface.py", line 709, in hug.interface.HTTP.__call__
  File "hug/interface.py", line 649, in hug.interface.HTTP.call_function
  File "hug/interface.py", line 100, in hug.interface.Interfaces.__call__
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/prodigy/app.py", line 102, in give_answers
    controller.receive_answers(answers)
  File "cython_src/prodigy/core.pyx", line 113, in prodigy.core.Controller.receive_answers
  File "cython_src/prodigy/util.pyx", line 270, in prodigy.util.combine_models.update
  File "cython_src/prodigy/models/ner.pyx", line 318, in prodigy.models.ner.EntityRecognizer.update
  File "cython_src/prodigy/models/ner.pyx", line 391, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src/prodigy/models/ner.pyx", line 385, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src/prodigy/models/ner.pyx", line 386, in prodigy.models.ner.EntityRecognizer._update
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/ARMn/lib/python3.6/site-packages/spacy/language.py", line 427, in update
    proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
  File "nn_parser.pyx", line 564, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 681, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "ner.pyx", line 118, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
  File "ner.pyx", line 177, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: "[E022] Could not find a transition with the name 'U-MAC_ADDRESS' in the NER model."

Please accept my apologies. This is not my domain. I am working on this area more on a hypothesis that this will be useful to analyze large amounts of data, so I am picking up the fundamentals of NLP as I go along. The back-and-forth will reduce in a bit as I get a hang of things. :slight_smile:

Changed all my references to a label that exists in the model, such as “PRODUCT” and it works fine. Leads me to believe that it is more a problem about adding a new label to the “en_core_web_sm” model than anything else. It was my understanding that the RegexMatcher approach would allow me to add new labels, but this seems to suggest otherwise. Please correct me if my understanding is wrong.

No worries! The problem you're trying to solve isn't trivial, and it's a very experimental process. Prodigy also introduces a lot of new concepts, so I totally understand if some things are a little confusing at first. (Also, sorry if some things aren't perfectly clear in the docs – we're still working on more tutorials and videos for more use cases.)

Yes, that's likely the correct analysis :+1: You can add the label to your model by getting the NER pipeline component and then calling its add_label method:

labels = ['MAC_ADDRESS']  # etc.
ner = nlp.get_pipe('ner')  # get the entity recognizer in the pipeline
for label in labels:
    ner.add_label(label)

Actually, if you're having problems with ner_manual, this was likely my fault. The manual NER interface needs the text to be tokenized, so you'll need to add the stream wrapper that takes care of that (just like in the ner.manual recipe):

from prodigy.components.preprocess import add_tokens

And then wrap it around your stream:

if not unsegmented:
    stream = split_sentences(nlp, stream)
stream = add_tokens(nlp, stream)

Also, could you check if you have the latest version of Prodigy installed (v1.5.1)? We've recently added some more debugging features, like stream validation and better error messages. So if you're still on an older version, the feedback you're getting if something goes wrong might not be as helpful yet.

Yes, I downloaded the latest version as soon as I got the email. That is what allowed me to fix the initial issues that I was seeing. And I went back and looked at ner_manual. I just added the tokenization as you replied. :smiley:

Let me troubleshoot with this and I will keep you posted. Hypothesizing about this approach makes me (and another student who actually works on NLP) think it may be a workable idea, but to what extent depends on how much I can get out of spaCy and Prodigy. I would like to thank you and @honnibal for the excellent tools, I could not dream of doing this with NLTK (as much as I did use NLTK for my initial feasibility analysis). I am glad I chanced upon spaCy and although it did take me some more time than I would have liked to get this running, things are looking up at the moment. Many thanks to you.

@ines,
Forgot to ask you a very naive question. When I teach my model how to recognize custom labels and regex patterns, how do I batch-train something at the end of the process? Do I need to write something along the lines of pos.batch-train or ner.batch-train to get this done?

Thanks so much, this really means a lot! The project sounds super interesting btw and I really hope you succeed – it'd make a great example of a very different and unique application of the technology.

The data you get out at the end of it are span annotations in Prodigy's format. This is the exact same format the other NER recipes produce. So to train your model, you can run the regular ner.batch-train recipe. The regex patterns and your custom pattern matcher mainly help you pre-select the examples for annotation – but the model you're training on the data later on doesn't care about any of this and just gets to see and learn from the final annotations.

Hi @ines,

Thank you so much for the encouragement. Will try my best to make it work. :blush:

Here are some initial thoughts after experimenting with the modeling. This works MARGINALLY better than the previous approach, but is still limited. With regex pattern-based learning, Prodigy at least learns the semantic context of where MACs and IPs occur. At least it does recognize MACs and IPs that it has not seen before. The limiting factor is that they still have to be from the same dataset and my understanding is that it is learning based on sentence structure. This technically means that my representative dataset will have to have like a zillion variants of how and where a MAC or an IP may occur (which again is infeasible).

I am trying to understand how NLP does this for English names and generalizes them across corpuses. The problem, at first glance, seems similar. There are so many names that teaching it all of them is infeasible, yet it happens.

It appears that there is some NLP part of this that I may be missing. Also, if someone does not have Prodigy, I am assuming I can do the same training using spaCy, as long as I compile the training set in a similar manner with the NERs or POSs and spans and use the following code snippet to train?

        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)

Yeah, well, this is kind of inherent to named entity recognition – and usually, the sentence structure and window around the entity is exactly what you want the predictions to be sensitive to. That's also what makes it work so well for "regular" named entities. If you haven't seen it already, you might find this video helpful. @honnibal explains the neural network model architecture in spaCy v2.0, with a focus on the NER model:

The features used in the model are the token attributes NORM, PREFIX, SUFFIX and SHAPE attribute. It's possible that your NER model actually needs to use different features to better reflect the text types you're working with (which are pretty different from natural language).

Also, it might not be the most satisfying answer, but have you considered using a purely rule-based approach for the MAC and IP addresses? It seems like you're able to narrow it down pretty well using regular expressions? You could implement this via a custom pipeline component that sets the doc.ents manually.

You can then use Prodigy to evaluate your rules, adjust them if they produce false positives and add more cases for the ones it missed. The evaluation process would be pretty straightforward: you stream in the predictions and accept/reject whether all entities are correctly labelled. And when you exit the server, you can calculate and output an accuracy number. Even if it's not possible to capture all cases, you can at least narrow in on the approach that produces the highest accuracy.

Yes, exactly. In fact, once you've collected more annotations and want to run large-scale batch training jobs, you might find the spacy train command better suited. Prodigy's training commands are really optimised for running lots of quick experiments, even with very small datasets.

@ines,
I have been taking a long, in-depth look at your suggestions that you listed here. After going over @honnibal’s video over and over, and analyzing my dataset, you are right. Seems like a rule-based approach that you suggested in the latter part of the reply makes sense, at least for now. I think the existing NER model can handle my other entities decently enough. So re-inventing the wheel for 2 entities does not make sense.

If it is not too much to ask, could you please elaborate a bit more on the rule-based suggestion of implementing a custom pipeline component and then having Prodigy evaluate it? I get the part where you stream in examples to Prodigy and it allows me to accept/reject its suggestion, but the part before that is a little unclear (not conceptually, but implementation-wise).

Sure, here are the relevant docs pages to get you started:

For example, you could pretty much reuse most of your RegexMatcher and write a function that takes a doc object, finds the IP or MAC addresses in the doc.text and returns the index of the start token and the index of the end token in the doc. This will let you create your own Span objects, which you can add to the doc.ents. Here's an example:

import spacy
from spacy.tokens import Span

def custom_matcher_component(doc):
    # This function will be run automatically when you call nlp
    # on a string of text. It receives the doc object and lets
    # you write to it – e.g. to the doc.ents or a custom attribute
    regex_matches = YourCustomRegexMatcher(doc)
    for start_token_index, end_token_index in regex_matches:
        # Create a new Span object from the doc, the start token 
        # and the end token index
        span = Span(doc, start_token_index, end_token_index, label=doc.vocab.strings['IP_ADDRESS'])
        # Overwrite the doc.ents and add your new entity span
        doc.ents = list(doc.ents) + [span]
    return doc

nlp = spacy.load('en_core_web_sm')  # or any model you want to use
nlp.add_pipe(custom_matcher_component, after='ner')

Adding the function to the pipeline means that it will be run automatically when you call nlp on a text. One important thing to note: Here, we're adding it after the ner component (the statistical named entity recognizer). Since a token can only be part of an entity, you'd need to make sure that the doc.ents don't contain any overlapping spans when you add your new spans to them – otherwise, spaCy will raise an error.

To evaluate your rules in Prodigy, you could then run your nlp object over a bunch of text and extract all IP_ADDRESS entities from the doc.ents and save the result in Prodigy's JSONL format. For example:

examples_to_evaluate = []
for doc in nlp.pipe(LIST_OF_TEXTS):  # use nlp.pipe here for better performance
    spans = [{'start': ent.start_char, 'end': ent.end_char,
              'label': ent.label_} for ent in doc.ents
              if ent.label_ in ('IP_ADDRESS', 'MAC_ADDRESS')]
    for span in spans:
        # Here, we want to create one example per span, so you
        # can evaluate each entity separately
        example = {'text': doc.text, 'spams': spans}
        examples_to_evaluate.append(example)

You can then stream in the data and accept/reject whether the entity produced by your rules is correct. Based on those annotations, you can calculate the percentage of correctly matched spans. As you change your rules and regular expressions, you can re-run the same evaluation with the same data, and compare the results.

@ines,
Thank you for that detailed explanation. I am actually trying that out now and was rewriting my recipes and code to account for what you detailed here.

I am noticing something and could really use your input on this.

  1. The regex matching seems to be working okay, but often I see only part of the pattern being recognized. If you look closely at the display here, it appears as if there is a space in the pattern (and initially lead me to believe that the tokenization was incorrect). I copy-pasted the text into a text editor to see if there were indeed any spaces or special characters, but nothing is showing up.

Many training examples end up like this.

  1. Prodigy takes time to load examples every now and then. Maybe a minute or so. Is this a function of how large the input file is? Anything I can do about it, other than cutting down the training examples in the file?

Any ideas are welcome and thanks in advance. Just in case you need it, I am posting my recipe below.

import re
import copy
import spacy
from prodigy.util import log
from prodigy.util import combine_models
from prodigy.core import recipe, recipe_args
from prodigy.models.ner import EntityRecognizer
from prodigy.components.loaders import get_stream
from prodigy.components.preprocess import add_tokens
from prodigy.components.sorters import prefer_uncertain
from prodigy.components.preprocess import split_sentences

MAC_PATTERN = r"(?:[0-9a-fA-F]{2}[-:]){5}(?:[0-9a-fA-F]{2})".strip()
IPV4_PATTERN = (
    r"(([0-9](?!\d)|[1-9][0-9](?!\d)|1[0-9]{2}(?!\d)|2[0-4][0-9](?!\d)|25[0-5](?!\d))[.]?){4}").strip()

REGEX_PATTERNS = [
    (re.compile(MAC_PATTERN), "MACADDRESS"),
    (re.compile(IPV4_PATTERN), "IPADDRESS")
    ]


class RegexMatcher(object):

    def __init__(self, expression, label):
        self.expression = re.compile(expression)
        self.label = label

    def __call__(self, batch):
        for eg in batch:
            for regex in REGEX_PATTERNS:
                for match in re.finditer(regex[0], eg['text']):  # find match in example text
                    task = copy.deepcopy(eg)  # match found – copy the example
                    start, end = match.span()  # get matched indices
                    print(eg, start, end)
                    task['spans'] = [{'start': start, 'end': end, 'label': regex[1]}]  # label match
                    yield 0.5, task  # (score, example) tuples

    def update(self, examples):
        # this is normally used for updating the model, but we're just
        # going to do nothing here and return 0, which will be added to
        # the loss returned by the model's update() method
        return 0


@recipe('net.teach',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        label=recipe_args['label_set'],
        patterns=recipe_args['patterns'],
        exclude=recipe_args['exclude'],
        unsegmented=recipe_args['unsegmented'])
def teach(dataset, spacy_model, source=None, api=None, loader=None,
          label=None, patterns=None, exclude=None, unsegmented=False):
    """
    Collect the best possible training data for a named entity recognition
    model with the model in the loop. Based on your annotations, Prodigy will
    decide which questions to ask next.
    """
    log("RECIPE: Starting recipe net.teach", locals())
    # Initialize the stream, and ensure that hashes are correct, and examples
    # are deduplicated.
    stream = get_stream(source, api=api, loader=loader, rehash=True,
                        dedup=True, input_key='text')
    # Create the model, using a pre-trained spaCy model.
    nlp = spacy.load(spacy_model)
    log("RECIPE: Creating EntityRecognizer using model {}".format(spacy_model))
    ner = nlp.get_pipe('ner')  # get the entity recognizer in the pipeline
    for pattern in REGEX_PATTERNS:
        ner.add_label(pattern[1])
    model = EntityRecognizer(nlp, label=label)
    if patterns is None:
        predict = model
        update = model.update
    else:
        matcher = RegexMatcher()
        # matcher = PatternMatcher(model.nlp).from_disk(patterns)
        log("RECIPE: Created RegexMatcher and loaded in patterns", patterns)
        # Combine the NER model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        predict, update = combine_models(model, matcher)
        stream = prefer_uncertain(predict(stream))
        # Split the stream into sentences
    if not unsegmented:
        stream = split_sentences(nlp, stream)
    stream = add_tokens(nlp, stream)
    # Return components, to construct Controller
    return {
        'view_id': 'ner',
        'dataset': dataset,
        # 'stream': stream,
        'stream': prefer_uncertain(predict(stream), algorithm='probability'),
        'update': update,  # callback to update the model in-place
        'exclude': exclude,
        'config': {
            'labels': (', '.join(label)) if label is not None else 'all'}
    }

Command:

prodigy net.teach mac_learning_dataset /tmp/model /Users/Abhishek/Command-Line-Tools/brat-v1.3/data/nlp-log-analysis/training-dataset.txt -F ~/Projects/Python-Projects/Projects/NM-NLP/net.py --label MACADDRESS

@ines,
Some more information. I just dumped the tokenization coming from the stream into the recipe and looked for one of the examples where this was happening, and it looks like improper tokenization. Here is the token dump for reference:

{'text': 'Dropping the radius packet for Station ac:37:43:4a:d9:78 f0:5c:19:21:ef:90 doing 802.1x', '_input_hash': 1033013760, '_task_hash': 1926972101, 'tokens': [{'text': 'Dropping', 'start': 0, 'end': 8, 'id': 0}, {'text': 'the', 'start': 9, 'end': 12, 'id': 1}, {'text': 'radius', 'start': 13, 'end': 19, 'id': 2}, {'text': 'packet', 'start': 20, 'end': 26, 'id': 3}, {'text': 'for', 'start': 27, 'end': 30, 'id': 4}, {'text': 'Station', 'start': 31, 'end': 38, 'id': 5}, {'text': 'ac:37:43:4a', 'start': 39, 'end': 50, 'id': 6}, {'text': ':', 'start': 50, 'end': 51, 'id': 7}, {'text': 'd9:78', 'start': 51, 'end': 56, 'id': 8}, {'text': 'f0:5c:19:21:ef:90', 'start': 57, 'end': 74, 'id': 9}, {'text': 'doing', 'start': 75, 'end': 80, 'id': 10}, {'text': '802.1x', 'start': 81, 'end': 87, 'id': 11}], 'spans': []}

I actually did reference the same problem in my other post where I asked Matthew about tokenization and provided some more examples there. I am not quite sure what is the reason for this inconsistency or how to resolve it. :slightly_frowning_face:

If you're going for the rule-based approach, I would leave out the model and active learning workflow and really just statically annotate your matches. I think part of what you're seeing here are the model's suggestions, which can be kinda random – for example, the example in your screenshot is just something the model guessed (it has a score assigned in the corner and not a pattern or other meta).

So I would just write a simple recipe that streams in your data and yields out the matches – nothing more.

The file size shouldn't be a problem if you're loading in data that can be read line-by-line (e.g. a plain text file). But because your recipe uses the prefer_uncertain sorter and a model, it can happen that you'll need to iterate over a lot of examples before it finds a suitable one to present you. If you remove all of that logic and just annotate your regex matches, this shouldn't be a problem anymore. (Of course, unlesss your regular expressions are slow, which doesn't seem to be a problem here.)

IP and MAC addresses are pretty non-standard, so it makes sense that the default English tokenization rules don't always handle them perfectly. spaCy's tokenizer does more than just split on whitespace (see here) and existing the rules and exceptions are really more optimised for natural language. But you can customise them and any custom rules you've added will be serialized with the model automatically.

If you're only annotating character offsets, you can also ignore the tokenization for now and deal with that later. (spaCy v2.1 will also allow the parser to predict "subtokens" and merge several tokens into one, so this could also be an interesting solution to explore in the future.)

Ah, got it. I will do that, but it is also a bit interesting that even though the recipe is partially based off a RegexMatcher entity and is supposed affect/dictate tokenization, it is not having the desired effect. I did go through the Tokenization scheme for Spacy and was under the impression that was how it worked. I am working on the static model now and will keep you posted. Thanks a ton for the amazing help @ines.

@ines,
In the example that you provided, I had to make some modifications to use it my code. Noticed something unusual. You have used doc when creating a span and I was getting an error saying it could not create a span from 50-67 for a Doc of length 17. I am just curious as to why len(doc) returns 17 and len(doc.text) returns 114, which is the length of the entire length of the log I am looking at. If anything, I would have expected any serialized/representative version of an object to be longer than just one of its members. :thinking:

[Abhishek:~/Projects/Python-Projects/Projects/NM-NLP] [NM-NLP] 4s $ python npl.py
print(doc) --> Network Login MAC user 787B8AACADE1 logged in MAC 78:7B:8A:AC:AD:E1 port 24 VLAN(s) "CSCF", authentication Radius
len(doc) --> 17
len(doc.text) --> 114
print(doc.text) --> Network Login MAC user 787B8AACADE1 logged in MAC 78:7B:8A:AC:AD:E1 port 24 VLAN(s) "CSCF", authentication Radius
Traceback (most recent call last):
  File "npl.py", line 185, in <module>
    doc = nlp(text)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/NM-NLP/lib/python3.6/site-packages/spacy/language.py", line 346, in __call__
    doc = proc(doc)
  File "npl.py", line 156, in custom_entity_matcher
    Span(doc, start_token_index, end_token_index, label=doc.vocab.strings[match[1]['spans'][0]['label']])
  File "span.pyx", line 58, in spacy.tokens.span.Span.__cinit__
IndexError: [E035] Error creating span with start 50 and end 67 for Doc of length 17.