Off-track use of Prodigy/Spacy - Custom Regex Pattern Matching and Modeling

A Doc is a collection of Token objects and its length is the number of tokens in the document (API reference here). If you iterate over a Doc, you’ll get tokens. You can see the tokens by printing [token.text for token in doc]. The doc.text attribute (API reference here) returns the verbatim document text, i.e. all token texts plus whitespace. So your Doc consists of 17 tokens and its text is 114 characters long.

A Span is a slice of the Doc and is created with the token indices (API reference here). Your code fails, because you’re trying to create a span object from character offsets instead of token indices.

Your RegexMatcher only really processes the Doc.text and finds character spans in it. It doesn’t tokenize the text – that happens when the Doc object is created. Prodigy can render character offsets just fine, but if you actually want to turn them into spaCy objects, you eventually need to resolve the characters back to tokens and make sure your tokenization matches the entity spans you want to extract. There are pretty much 3 options:

  1. Update the tokenization rules of your model so it preserves the IP and MAC adresses as single tokens. Not sure how well this works, because they’re pretty special strings and you want to avoid introducing undesired side-effects and wrong tokenization in other places.
  2. If your addresses are split into more than one token but are correctly separated from the rest of the text, use the Matcher to write token-based rules for the fragments. Instead of just the strings and character offsets, this will give you the actual tokens.
  3. Write a function that takes your character offsets and maps them to the existing tokens, if possible. This is essentially what Prodigy’s add_tokens helper does. It’s also the same logic as spaCy’s gold.biluo_tags_from_offsets helper that maps character offsets to token-based tags. This might even be one of the most efficient solutions in your case.

@ines,
That clearly explains why the code is failing. I should have connected 14 to the number of tokens in my text. Overlooked a simple detail. :man_facepalming:

So, for my case, #1 is actually been taken care of. Although I have not noticed any adverse effects elsewhere, for now. I am content to just get this PoC working. :slightly_smiling_face: #2 was solved when I fixed #1, so I will keep it in mind for later. As you said, #3 seems to be the most efficient way of doing it.

Could you please clarify one thing for me? I was looking into the Spacy API and found Doc.char_span that seems to be doing what you are asking me to do. I am not sure if the documentation is misleading or I am misunderstanding it. Every other place clearly says “The index of the first/last token of the span.”. But this API says “The index of the first/last character of the span.”. Going by the API documentation, this is supposed to do what you just explained in your post, and it is not working, which leads me to believe the documentation actually meant token index and not character index.

@ines,
I have most of the solution you proposed figured out and “almost” working, except for one hitch. Consider this line Network Login MAC user 787B8AACADE1 logged in MAC 78:7B:8A:AC:AD:E1 port 24 VLAN(s) 10.1.1.1, authentication Radius. The tokenization for this log is pretty good thus far, without any sub-tokenization occurring etc. There are three entities here, USER, MACADDRESS and IPADDRESS. Entity recognition via rule matching is working for the first two. Problem with the last one is that Spacy is tokenizing the IPADDRESS with the comma suffix for some reason as 10.1.1.1, and the token span and indices are constructed according to this. When I reconstruct my character spans to token indices like the gold.biluo_tags_from_offsets, it throws and error because there is now a mismatch between the expected end character and actual end character.

I have been going through the Spacy tokenization API/usage and been looking at my code. I cannot figure out why this is happening even with a suffix directive for my custom tokenizer.

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$,''')
infix_re = re.compile(r'''[~]''')


def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab,
                     prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer
                     )

I do understand that my tokenization requirements are different and exceptions are varied, but a suffix strip of a character in the suffix regex is being ignored, that should be processed according to the Spacy tokenization scheme. Appreciate any clarification on this issue. :slight_smile:

Slightly modified your suffixes to include the , in the character class (between [ and ]) and it now splits the , off correctly:

suffix_re = re.compile(r'''[\]\)"',]$''')

@ines,
That fixed the tokenization. I feel I am getting closer to the solution, but keep hitting more problems. This time, it is a Prodigy thing (or at least I think so). I am noticing my log being tagged with a weird “WORK_OF_ART” entity by default. Please bear with me while I detail the training and run scenarios.

Output with base en_core_web_lg:
Nothing has been trained here, and Spacy/Prodigy are trying to identify entities based on their in-built models.

MAC 14 17 ORG
787B8AACADE1 23 35 CARDINAL
MAC 46 49 ORG
24 VLAN(s 73 82 QUANTITY
10.1.1.1 84 92 ORG
Radius 109 115 ORG

Output with model trained with one textcat label and one ner label:

Multiple runs produce different tokenizations and different entity tags, none of which have been trained for, except MACADDRESS. The other labels have just been added to the NER label list.

Network 0 7 WORK_OF_ART
Login 8 13 WORK_OF_ART
MAC 14 17 IPV4ADDRESS
user 18 22 WORK_OF_ART
787B8AACADE1 23 35 WORK_OF_ART
logged 36 42 WORK_OF_ART
in 43 45 WORK_OF_ART
MAC 46 49 WORK_OF_ART
78:7B:8A:AC:AD:E1 50 67 MACADDRESS
port 68 72 WORK_OF_ART
24 73 75 WORK_OF_ART
VLAN(s 76 82 WORK_OF_ART
) 82 83 WORK_OF_ART
10.1.1.1 84 92 WORK_OF_ART
, 92 93 WORK_OF_ART
authentication 94 108 WORK_OF_ART
Radius 109 115 WORK_OF_ART
====================================================================
Network 0 7 WORK_OF_ART
Login 8 13 WORK_OF_ART
MAC user 787B8AACADE1 logged in MAC 78:7B:8A:AC:AD:E1 port 24 VLAN(s 14 82 IPV4ADDRESS
) 82 83 WORK_OF_ART
10.1.1.1 84 92 WORK_OF_ART
, 92 93 WORK_OF_ART
authentication 94 108 WORK_OF_ART
Radius 109 115 WORK_OF_ART
=============================================================================
Network 0 7 PROCESS
Login 8 13 WORK_OF_ART
MAC 14 17 IPV4ADDRESS
user 18 22 WORK_OF_ART
787B8AACADE1 logged 23 42 FILE/RESOURCE
in 43 45 WORK_OF_ART
MAC 78:7B:8A:AC:AD:E1 port 46 72 FILE/RESOURCE
24 73 75 WORK_OF_ART
VLAN(s 76 82 WORK_OF_ART
) 82 83 WORK_OF_ART
10.1.1.1 84 92 WORK_OF_ART
, 92 93 WORK_OF_ART
authentication 94 108 WORK_OF_ART
Radius 109 115 WORK_OF_ART

Because of the problems with the previous run, I thought of creating a model with just one entity label, MACADDRESS and see if the same issue occurs here. The output of the Prodigy training session:

[Abhishek:~/Projects/Git-Repositories/spaCy] [NM-NLP] master(+25/-6,-1) 4s ± prodigy ner.batch-train mac_learning_dataset en_core_web_lg --output-model /tmp/models/net_labels --eval-split 0.2

Loaded model en_core_web_lg
Using 20% of accept/reject examples (125) for evaluation
Using 100% of remaining examples (501) for training
Dropout: 0.2  Batch size: 16  Iterations: 10


BEFORE     0.000
Correct    0
Incorrect  88
Entities   310
Unknown    310


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         23.160     44         44         662        0          0.500
02         18.247     66         22         644        0          0.750
03         16.106     79         9          578        0          0.898
04         15.077     82         6          727        0          0.932
05         14.991     80         8          667        0          0.909
06         13.212     78         10         653        0          0.886
07         12.505     79         9          663        0          0.898
08         11.707     80         8          691        0          0.909
09         10.450     79         9          678        0          0.898
10         9.926      78         10         676        0          0.886

Correct    82
Incorrect  6
Baseline   0.000
Accuracy   0.932

Model: /private/tmp/models/net_labels
Training data: /private/tmp/models/net_labels/training.jsonl
Evaluation data: /private/tmp/models/net_labels/evaluation.jsonl

I then proceed to use this fresh model with my code. To be on the safer side, I disabled my custom matcher component and the inconsistent output disappeared, but not the WORK_OF_ART thing. I don’t even know where that is coming from.

Output when model trained with a single new NER entity MACADDRESS (dataset is new, does not contain annotations of any other type, model is new, as in, not overwriting any old data here):

Network 0 7 WORK_OF_ART
Login 8 13 WORK_OF_ART
MAC 14 17 WORK_OF_ART
user 18 22 WORK_OF_ART
78:7B:8A:AC:AD:E1 50 67 MACADDRESS
port 68 72 WORK_OF_ART
24 73 75 WORK_OF_ART
VLAN(s) 76 83 IPV4ADDRESS
10.1.1.1 84 92 WORK_OF_ART
, 92 93 WORK_OF_ART
authentication Radius 94 115 IPV4ADDRESS

The only reason there is an IPV4ADDRESS label in there is because I added it to the NER label list to see if something that should not be happening was occurring. At this point, my custom code has been disabled and is not being hit at all.

I don’t know if training a model is messing with the actual NER process and how it recognizes unknowns. Even if that were to be true, I would expect it to miss complex tokens such as USER, MACADDRESS and such. I am not sure why it is tagging every single token in the document with that? Did I inadvertently hit a bug in the code?

Tokenization also appears inconsistent (and by inconsistent I mean different runs produce different results, with nothing changed). This is with custom components disabled.

First Run:

Network Login 0 13 IPV4ADDRESS
user 18 22 WORK_OF_ART
78:7B:8A:AC:AD:E1 50 67 MACADDRESS
VLAN(s) 76 83 WORK_OF_ART
10.1.1.1 84 92 QUANTITY
, 92 93 WORK_OF_ART
authentication 94 108 WORK_OF_ART
Radius 109 115 WORK_OF_ART

Second Run:

Network 0 7 WORK_OF_ART
Login 8 13 WORK_OF_ART
MAC 14 17 WORK_OF_ART
user 18 22 WORK_OF_ART
78:7B:8A:AC:AD:E1 50 67 MACADDRESS
VLAN(s) 76 83 WORK_OF_ART
10.1.1.1 84 92 QUANTITY
, 92 93 WORK_OF_ART
authentication 94 108 WORK_OF_ART
Radius 109 115 WORK_OF_ART

Third Run:

Network 0 7 WORK_OF_ART
Login 8 13 WORK_OF_ART
MAC 14 17 WORK_OF_ART
user 18 22 WORK_OF_ART
78:7B:8A:AC:AD:E1 50 67 MACADDRESS
VLAN(s) 76 83 WORK_OF_ART
10.1.1.1 84 92 QUANTITY
, 92 93 WORK_OF_ART
authentication 94 108 IPV4ADDRESS
Radius 109 115 IPV4ADDRESS

@ines,
Any way this could be related to the catastrophic-forgetting problem? I thought that was less likelier to occur in Spacy-2.0.0.

Yeah, something like that – it's most likely a side-effect from the weights of the existing pre-trained model. If you search for WORK_OF_ART, you'll see that this is a pretty common problem. That's likely due to the fact that WORK_OF_ART is the rarest category. See Matt's comment here:

v2.1.0 will already ship with various improvements to the entity recognizer that should hopefully prevent some of this. We're also working on bringing more built-in solutions to prevent catastrophic forgetting into the core library. For example, when training a model, you'd then be able to "remind" it of what it previously predicted.

@ines,
That was quick. :slight_smile: Thank you for clarifying that. Since that is a really rare thing, let me try boot-strapping the model with a few examples and see if it makes it go away. Or maybe just try and use a blank Spacy model to do the same.

On a related note, if these changes are part of spacy-nightly, I can try running this off that build. I actually already tried it, but am getting some errors that I was not able to find documentation/APIs for.

[Abhishek:~/Projects/Python-Projects/Projects/NM-NLP] [Spacy-Experimental] 4s $ python npl.py
Traceback (most recent call last):
  File "npl.py", line 183, in <module>
    nlp = spacy.load('/tmp/models/net_labels')
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/Spacy-Experimental/lib/python3.6/site-packages/spacy/__init__.py", line 22, in load
    return util.load_model(name, **overrides)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/Spacy-Experimental/lib/python3.6/site-packages/spacy/util.py", line 119, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/Spacy-Experimental/lib/python3.6/site-packages/spacy/util.py", line 157, in load_model_from_path
    component = nlp.create_pipe(name, config=config)
  File "/Users/Abhishek/Projects/Python-Projects/Python-VEs/Spacy-Experimental/lib/python3.6/site-packages/spacy/language.py", line 243, in create_pipe
    raise KeyError(Errors.E108.format(name=name))
KeyError: "[E108] As of spaCy v2.1, the pipe name `sbd` has been deprecated in favor of the pipe name `sentencizer`, which does the same thing. For example, use `nlp.create_pipeline('sentencizer')`"

Would you suggest that I experiment with the nightly builds and see if they fix my problems? And Prodigy somehow does not want to work with the dev versions of Spacy and thinc.Not sure why. Spacy-nightly installs fine, thinc-7 dev installs fine, but then when I use the Prodigy whl file, it uninstalls thinc-7 dev and puts in thinc-6.1.12, thus making Spacy-nightly useless too.

I am open to trying anything to get this working. :slight_smile:

We haven't really tested Prodigy with the current spaCy nightly yet, so I can't promise that this will work.

Looks like Prodigy currently adds the component sbd for rule-based sentence boundary detection to models if they don't have a parser – but as of 2.1.0, that component will only be called sentencizer. I'm not sure if this is also done in the Prodigy internals, or only in the recipes – but check for nlp.create_pipe('sbd') and replace it with nlp.create_pipe('sentencizer').

That's because pip installs the dependencies for the wheel file by default. This means it'll overwrite thinc, and install spacy alongside spacy-nightly (which is a separate package). So you'll end up with 2 global spacys, which will definitely mess up your environment.

You can either install Prodigy first, uninstall spacy (!) and install spacy-nightly, or install the Prodigy wheel with pip install --no-deps to prevent it from installing the dependencies. But like I said, I'm still not sure if it will work – you might run into other stuff. So really only do this if you're prepared to spend some time debugging stuff.

1 Like

@ines,
On it. Trying out all possibilities. Will keep you folks posted. And as always, thanks a zillion. :slight_smile:

@ines,
So, I have been trying out all your suggestions and the rule-based matcher is working for the most part, at least for MACADDRESS. I have it annotated examples in Spacy and am verifying them on Prodigy.

I am finding something happening that I cannot seem to explain through my efforts. The regex rule-based matching for IPADDRESS is not working via Spacy and is all over the place. I have taken apart my code and tested it line by line in the interpreter, including the regex separately online to see if there is any problem with the regex, but to no avail. Let me try to provide you with screenshots and see if it shows what I am talking about.

Python Interpreter Regex Match:

Regex Matcher:

Spacy Rule-Based Annotation:

There is an IPADDRESS right there in the example that it has not annotated. A jsonl example is below.

{"text": "error: Received disconnect from 103.99.0.122: 14: No more user authentication methods available. [preauth]", "spans": [{"start": 46, "end": 49, "text": "14:", "label": "IPV4ADDRESS"}, {"start": 50, "end": 52, "text": "No", "label": "IPV4ADDRESS"}, {"start": 86, "end": 96, "text": "available.", "label": "IPV4ADDRESS"}, {"start": 97, "end": 98, "text": "[", "label": "IPV4ADDRESS"}], "tokens": [{"text": "error:", "start": 0, "end": 6, "id": 0}, {"text": "Received", "start": 7, "end": 15, "id": 1}, {"text": "disconnect", "start": 16, "end": 26, "id": 2}, {"text": "from", "start": 27, "end": 31, "id": 3}, {"text": "103.99.0.122:", "start": 32, "end": 45, "id": 4}, {"text": "14:", "start": 46, "end": 49, "id": 5}, {"text": "No", "start": 50, "end": 52, "id": 6}, {"text": "more", "start": 53, "end": 57, "id": 7}, {"text": "user", "start": 58, "end": 62, "id": 8}, {"text": "authentication", "start": 63, "end": 77, "id": 9}, {"text": "methods", "start": 78, "end": 85, "id": 10}, {"text": "available.", "start": 86, "end": 96, "id": 11}, {"text": "[", "start": 97, "end": 98, "id": 12}, {"text": "preauth", "start": 98, "end": 105, "id": 13}, {"text": "]", "start": 105, "end": 106, "id": 14}]}

There are many more examples like this. Actually, it is extremely unlikely that it got anything right with the IPADDRESS regex. I don’t seem to understand why the same regex works outside of the Spacy rule-based matcher, but not inside it. Also, it is weird that it seems to be NER tagging regular english words with the label, as in the example.

Regex Patterns:

REGEX_PATTERNS = [(URL_PATTERN, 'URL'),
                  (MAC_PATTERN, 'MACADDRESS'),
                  (IPV4_PATTERN, 'IPV4ADDRESS'),
                  (IPV6_PATTERN, 'IPV6ADDRESS'),
                  (PROCESS_PATTERN, 'PROCESS'),
                  (FILE_PATTERN, 'FILE/RESOURCE'),
                  (HYPHENATED_PATTERN, 'HYPHENATED-ENTITY'),
                  (KEY_VALUE_PATTERN, 'KEY-VALUE PAIR')
                  ]

Regex-Matcher Code:

class RegexMatcher(object):

    def __init__(self, expression, label):
        self.regex_patterns = defaultdict()
        self.regex_patterns[label] = re.compile(expression, re.UNICODE)

    def __call__(self, document):
        task = {}
        for label, expression in self.regex_patterns.items():
            for match in re.finditer(expression, document.text):  # find match in example text
                # task = copy.deepcopy(eg)  # match found – copy the example
                start, end = match.span()
                # get matched indices
                task["spans"] = [{"start": start, "end": end, "text": match.group(), "label": label}]  # label match
                yield 0.5, task  # (score, example) tuples

    def add_regex_patterns(self, expression, label):
        self.regex_patterns[label] = re.compile(expression, re.UNICODE)

    def update(self, examples):
        # this is normally used for updating the model, but we're just
        # going to do nothing here and return 0, which will be added to
        # the loss returned by the model's update() method
        return 0

    def get_regex_patterns(self):
        return self.regex_patterns

Rule-Based Matcher Code:

def convert_char_span_to_token_idx(doc, entities=None):

    token_starts = {token.idx: token.i for token in doc}
    token_ends = {token.idx+len(token): token.i for token in doc}

    for start_char, end_char, label in entities:
        # Get the token start and end indices, which in our case should be the same,
        # since we normally do not have multi-token spans
        token_start_index = token_starts.get(start_char)
        token_end_index = token_ends.get(end_char)

        if token_start_index is not None and token_end_index is not None:
            if token_start_index == token_end_index:
                # We have a single token that matches
                return token_start_index, token_end_index
            else:
                # TODO - handle multi-token spans later
                pass
        elif token_start_index is None or token_end_index is None:
            return None, None
        else:
            pass


def custom_entity_matcher(doc):
    # This function will be run automatically when you call nlp
    # on a string of text. It receives the doc object and lets
    # you write to it – e.g. to the doc.ents or a custom attribute

    regex_matcher = RegexMatcher(REGEX_PATTERNS[1][0], REGEX_PATTERNS[1][1])
    regex_matcher.add_regex_patterns(REGEX_PATTERNS[2][0], REGEX_PATTERNS[2][1])
    regex_matches = regex_matcher(doc)
    for match in regex_matches:
        # Create a new Span object from the doc, the start token
        # and the end token index
        char_start_offset = match[1]['spans'][0]['start']
        char_end_offset = match[1]['spans'][0]['end']
        entity = (char_start_offset, char_end_offset, match[1]['spans'][0]['label'])
        if entity is None:
            continue
        elif doc is None:
            continue
        else:
            token_start, token_end = convert_char_span_to_token_idx(doc, [entity])
            if token_start is None or token_end is None:
                continue
            span = \
                Span(doc, token_start, token_end, label=doc.vocab.strings[match[1]['spans'][0]['label']])
            # Overwrite the doc.ents and add your new entity span
            doc.ents = list(doc.ents) + [span]
    return doc

I am just not able to figure out if this is a general regex issue (which it does not seem to be, by my external tests) or a Spacy rule-matcher issue. Any inputs are greatly appreciated. Till then, I can at least train it on the other label.

@ines,
Any inputs on this issue though? Haven’t been able to solve this. Or even figure out why this is happening.

Sorry, I didn't have time to run your code in detail yet. Your regular expressions seem fine – but in your recipe, you're using the regex matcher together with a model in the loop, right? If so, it looks like the examle in your screenshot might be one that was suggested by the model, not by your patterns.

(A general tip for debugging: You probably want your regex matcher to write to the task's meta and include that it was produced by the matcher. For example, "meta": {"regex": true} or even the name of the pattern that was matched. This makes it easier to see what's going on.).

Also, another thing I just noticed:

Here, you're setting up a blank task but you only ever add "spans" to it and no "text". This means that Prodigy won't be able to render it.

@ines,
Thank you for clarifying that. So a couple of follow-up questions then:

  1. The screenshot that you see if me using prodigy.mark to evaluate my choices. I am not sure if the model is in the loop at that point, since there is no scoring involved. I was trying to follow your earlier suggestion and also found this mark recipe thing on another post on the forum. This was your previous suggestion:

You can then stream in the data and accept/reject whether the entity produced by your rules is correct. Based on those annotations, you can calculate the percentage of correctly matched spans. As you change your rules and regular expressions, you can re-run the same evaluation with the same data, and compare the results.

Did I interpret this suggestion correctly and use prodigy.mark or did you want me to keep the model in the loop?

  1. I will add in the meta information now.

As for your last point, I am just passing the spans back to the other method here. Here is my actual code where I write the evaluation examples:

tokens = [{"text": token.text, "start": token.idx,
                   "end": (token.idx + len(token)),
                   "id": token.i}
                  for token in file_doc]

spans = [{"start": ent.start_char, "end": ent.end_char,
                  "text": ent.text, "label": ent.label_} for ent in file_doc.ents
                 if ent.label_ in ('IPV4ADDRESS', 'MACADDRESS')]
        for span in spans:
            # Here, we want to create one example per span, so you
            # can evaluate each entity separately
            example = {"text": file_doc.text, "spans": spans, "tokens": tokens}
            evaluation_examples.append(example)

Is this what you wanted me to do?

Sample JSONL entry:

{"text": "Authentication failed for Network Login MAC user AC1F6B2FD18F Mac AC:1F:6B:2F:D1:8F port 43", "spans": [{"start": 66, "end": 83, "text": "AC:1F:6B:2F:D1:8F", "label": "MACADDRESS"}], "tokens": [{"text": "Authentication", "start": 0, "end": 14, "id": 0}, {"text": "failed", "start": 15, "end": 21, "id": 1}, {"text": "for", "start": 22, "end": 25, "id": 2}, {"text": "Network", "start": 26, "end": 33, "id": 3}, {"text": "Login", "start": 34, "end": 39, "id": 4}, {"text": "MAC", "start": 40, "end": 43, "id": 5}, {"text": "user", "start": 44, "end": 48, "id": 6}, {"text": "AC1F6B2FD18F", "start": 49, "end": 61, "id": 7}, {"text": "Mac", "start": 62, "end": 65, "id": 8}, {"text": "AC:1F:6B:2F:D1:8F", "start": 66, "end": 83, "id": 9}, {"text": "port", "start": 84, "end": 88, "id": 10}, {"text": "43", "start": 89, "end": 91, "id": 11}]}

Sorry, I was getting a big confused by all of that nested logic and the the pattern matcher that yields examples etc. And yes, if you want to create static annotation examples (one for each span) and render them exactly as they come in to accept/reject (e.g. with the mark recipe), you’d really only need something like this?

examples = []
for text in LOTS_OF_TEXTS:
    for label, regex_patterns.items():
        for match in re.finditer(expression, text):
            start, end = match.span()
            span = {"start": start, "end": end, "label": label}
            task = {"text": text, "spans": [span]}
            examples.append(task)