adding custom attribute to doc, having NER use attribute

I have made a script to auto-reject samples that do not match a pattern from a pattern file when annotating - this works okay for some entities, such as zip-code (I have other 5 digit numbers that aren’t zip-codes). But ultimately it relies on giving the model lots of rejected examples of the entity. For other entities the model will have a much harder time learning to reject a token - if doesn’t match the pattern, for instance if our entity has to match one of a large list.

What would be much better is if our model knew whether or not a given span matches a pattern! To that end I have created a custom component, that adds an attribute signifying if the span matches the pattern.

First Question) Will the NER “know” about my custom attributes? i.e. will it be encoded in the tensor as input?

Second Question) When loading the model using ner.batch-train I am getting the following error:
KeyError: “Can’t find factory for ‘pattern_detector’.”
I read your comment from [Load error after adding custom textcat model to the pipeline]
but I don’t understand where/how to add the factory.


class Pattern_Matcher(object):
    def __init__(self,nlp, label):
        self.vocab = nlp.vocab
        self.entityname = label
        self.label = nlp.vocab.strings[self.entityname]
        self.matcher = Matcher(nlp.vocab)
        self.name = "pattern_detector"
        self.nlp = nlp
        self.fill_matcher_w_patterns()
        Token.set_extension('is_' + self.entityname , default=False)


    def fill_matcher_w_patterns(self):
        pattern_path = '/data/prodigy/patterns/'+self.entityname+'_pattern.jsonl'
        patterns = []
        with open(str(pattern_path), "r") as f:
            for line in f:
                print (line)
                label =json.loads(line)['label']
                p = json.loads(line)['pattern']
                self.matcher.add(label, None, p)
        print('done adding patterns to matcher')


    def __call__(self,doc):
        matches = self.matcher(doc)
        spans = []
    
        for _, start, end in matches:
            entity = Span(doc,start, end, label=self.label)
            spans.append(entity)
            for token in entity:
                token._.set('is_' + self.entityname, True)
        for span in spans:
            span.merge()

        return doc




def save_model():
    nlp = spacy.load('en_core_web_lg')

    component = Pattern_Matcher(nlp)  # initialise component
 
    nlp.add_pipe(component, first =True)
    nlp.factories["pattern_detector"] = lambda nlp, **cfg: Pattern_Matcher(nlp, label,**cfg)
    print('Pipeline', nlp.pipe_names)
    nlp.to_disk('/data/prodigy/models/zip_me')
'''
1 Like

Out-of-the-box, no, it won't. The NER model (and other spaCy models) make use of the norm, prefix, suffix and shape lexical attributes. When you load a model, the vocabulary loads pre-computed values for these features for most common words. For other words, the feature functions in nlp.vocab.lex_attr_getters are used to compute the features.

You could try hijacking one of these features to see if it helps. For instance, you could stick your pattern matching onto the shape feature as follows:

from spacy.attrs import SHAPE

get_shape = nlp.vocab.lex_attr_getters[SHAPE]
nlp.vocab.lex_attr_getters[SHAPE] = lambda string: get_shape(string) + matches_pattern(string)
for lex in nlp.vocab:
    # Update any cached values.
    lex.shape_ = nlp.vocab.lex_attr_getters[STRING](lex.orth_)

Note that if you do hack the features this way, the pre-trained tagger, parser etc models won't behave properly. If you do find the feature useful, you can subclass the NamedEntityRecognizer class and change its Model() classmethod, which builds the network. But that will be a bit of effort --- it seems better to check that the feature works first.

Can you please respond to this question as no matter what I do it won't matter if I can't get this bit going.
I think I just need more explicit instructions - how/where to set Language.factories, and does my class need a to_disk method?

When spaCy loads the model, it checks the pipeline and will initialise the individual components by calling Language.create_pipe, which looks up the respective factory. You should be able to simply write to the factories attribute, which is a dictionary:

Language.factories['pattern_detector'] = lambda nlp, **cfg: PatternMatcher(nlp, **cfg) 

The easiest way to ship custom code with your model is to package it as a Python package, using the spacy package command. The model's __init__.py and load() method can execute any code, and also include your custom pipeline components or any other arbitrary data. (If your component depends on other libraries, you can even specify those in your model package's requirements).

Btw, when you call spacy.load with keyword arguments, all of those will be passed through to your model's load method. So if you allow **cfg parameters on your custom component, you can load your model like this:

nlp = spacy.load('your_model', label='YOUR_LABEL')

You could even add more arguments, like the data path, so you don't have to hard-code any of this into your model.

It shouldn't need one – at least, nlp.to_disk should only call a pipeline component's to_disk method if it exists. But if you do want your component to save data, you can add a simple method that takes the path and saves out data to that location. For example, something like this:

def to_disk(self, path):
    patterns_path = path / 'patterns.json'
    patterns = json.dumps(self.patterns)
    patterns_path.open('w', encoding='utf8').write(patterns)

For more details, see the API docs of the abstract base class Pipe. It also includes examples of the methods you can implement to make your component trainable (although, I guess this should be less relevant in your case).

No luck so far :slightly_frowning_face:
I successfully made and loaded a custom language model package but I still get the same error:
KeyError: “Can’t find factory for ‘pattern_detector’.”

Could you please send me a link to a simple example of how to integrate a custom factory into a spacy model and use it.

Thanks!

Thanks for updating and sorry if this has been frustrating. I actually felt inspired by this discussion and built a little example model – see below for the full code :blush:

One thing that’s important to keep in mind with this approach is that you do need to package your model, install it via pip and load it in from the package, rather than loading it in from a path. Otherwise, your custom code in the model’s __init__.py won’t be executed. (If you’re loading from a path, spaCy will only refer to the model’s meta.json and not actually run the package.)

python -m spacy package /your_model /tmp
cd /tmp/your_model-0.0.0
python setup.py sdist
pip install dist/your_model-0.0.0.tar.gz

Code example

My code assumes that your model data directory contains an entity_matcher directory with a patterns.json file. In my example, I’m using a JSON file with an object keyed by entity label, e.g. {"GPE": [...]}. It also adds a custom via_patterns attribute to the spans that lets you see whether an entity was added via the matcher when you use the model in spaCy. This is just a little gimmick for example purposes – so you can leave it out if you don’t need it.

I’ve just used one of the default spaCy models, added "entity_matcher" to the pipeline in the meta.json, and used the following for the model package’s __init__.py:

# coding: utf8
from __future__ import unicode_literals

from pathlib import Path
from spacy.util import load_model_from_init_py, get_model_meta
from spacy.language import Language
from spacy.tokens import Span
from spacy.matcher import Matcher
import ujson


__version__ = get_model_meta(Path(__file__).parent)['version']


def load(**overrides):
    Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
    return load_model_from_init_py(__file__, **overrides)


class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, **cfg):
        Span.set_extension('via_patterns', default=False)
        self.filename = 'patterns.json'
        self.patterns = {}
        self.matcher = Matcher(nlp.vocab)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            span._.via_patterns = True
            spans.append(span)
        doc.ents = list(doc.ents) + spans
        return doc

    def from_disk(self, path, **cfg):
        patterns_path = path / self.filename
        with patterns_path.open('r', encoding='utf8') as f:
            self.from_bytes(f)
        return self

    def to_disk(self, path):
        patterns = self.to_bytes()
        patterns_path = Path(path) / self.filename
        patterns_path.open('w', encoding='utf8').write(patterns)

    def from_bytes(self, bytes_data):
        self.patterns = ujson.load(bytes_data)
        for label, patterns in self.patterns.items():
            self.matcher.add(label, None, *patterns)
        return self

    def to_bytes(self, **cfg):
        return ujson.dumps(self.patterns, indent=2, ensure_ascii=False)

You can also modify the code to take a path to a patterns file, instead of loading the patterns from the model data. This depends on whether you want to ship your patterns with the model, or swap them out. (Shipping your data with the model can be nice if you intend to share it with others – so you can send the .tar.gz model to someone else on your team, and they’ll be able to just pip install and use it straight away.)

As I mentioned above, the **cfg settings are passed down to the component from spacy.load, so instead of the from_disk and from_bytes methods, you can also just get the path or a list of patterns from the config parameters:

patterns_path = cfg.get('patterns_path')  # get the path and then read it in
patterns = cfg.get('patterns', [])  # get a list of patterns
nlp = spacy.load('your_model', patterns_path='patterns.json')
nlp = spacy.load('your_model', patterns=[{'LOWER': 'foo'}])

Hope this helps!

Closer! I followed your code/file structure and patterns.json.
I made a small change from :slight_smile:

self.matcher.add(label, None, *patterns)

to:

self.matcher.add(label, None, patterns)

I added the custom component first in the pipeline. My base model is en_core_web_lg. The model now loads but when you hand it a string as in:

doc = nlp(u'54354 is a zip')

I get the following error:

Segmentation fault (core dumped)

When I load disabling the ner everything works as expected.

Ah okay – I think my patterns file included a list of separate patterns for the same label, so I made the component add them all to one match pattern. But you can obviously solve this however you like.

A segfault should never happen, so there’s at least an unhandled error somewhere. One likely explanation could be that something goes wrong when the entity recognizer encounters the already set entities. This should work – but there might be certain edge cases where it fails. (If I remember correctly, there’s currently an open issue about a similar problem on the spaCy tracker, which we weren’t able to reproduce. But your example here is actually very nice and isolated – so if you’re able to share an example, we might be able to get to the bottom of this bug :muscle:)

One thing you could try in the meantime is to add the component last, i.e. after the entity recognizer. Before modifying the doc.ents in your component’s __call__ method, you could also check whether it already includes an entity for that exact span (or an overlapping span) and filter that out. In your case, this should be fairly easy – I’m pretty sure your zip codes will be recognised as separate ORDINAL entities.

I don’t think there is anything to share - all the code you provided! The pattern file is just:

{ "ZIPCODE": [ {"IS_DIGIT":true ,"LENGTH":5} ] }

I also tried the same thing with another pattern just seeing if the text is a particular string and the same behavior occured.

When I feed in text that does not match the pattern as in:

doc = nlp('This text does not contain a zipcode')

the segmentation error does not occur.

Another thing:
When I run ner.batch-train with the label ZIPCODE I do not get the error … the model trains per usual!
Unfortunately as Honnibal pointed out the NER does not use custom attributes so there is no improved performance, so as suggested I will try hijacking the shape_ attribute of the vocab.

Unfortunately, this causes a stack overflow error ---- matches_pattern(string) must use a Matcher object which uses the lex_attr_getter function for the shape_ of the input.
Another issue is that changing the shape of the vocab will only be effective if the terms I want to change happen to be in the vocabury -- in the case of 5 number digits this is going to be infrequent.

I ended up using:

for i in range(100000):
     new_zip =  u"{0:0=5d}".format(i)
     nlp.vocab.set_vector(new_zip, np.zeros(300) )
     nlp.vocab[new_zip].shape_ = u'IMAZIPCODE'

You could do something similar if you are looking to only match terms from your patterns file.

The lex_attr_getter functions are called on each new word, if it's not found in the vocab. So if you set your shape-setting function in vocab.lex_attr_getters[SHAPE] it should work.

I'm really confused by this segfault. I can't see what would be going on --- the following works fine for me:

>>> from spacy.matcher import Matcher
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp(u'90210 is a zip code')
>>> m = Matcher(nlp.vocab)
>>> m.add('ZIPCODE', None, [ {"IS_DIGIT":True ,"LENGTH":5} ])
>>> m(doc)
[(12564065238629231850, 0, 1)]

So there must be something else going on. I normally debug segfaults by commenting out parts of the program until it runs successfully, and then commenting back in. It's sort of a brutal approach, but usually the binary search only takes a few tries.

So does it run if you don't do this bit?

If that runs, I think the problem is in the merging. It's hard to make sure that the token offsets stay valid after merging spans, especially if we have some spans overlapping. Try this:

def __call__(self,doc):
    matches = self.matcher(doc)
    spans = []
    for _, start, end in matches:
        entity = Span(doc,start, end, label=self.label)
        spans.append((entity.start_char, entity.end_char))
        for token in entity:
            token._.set('is_' + self.entityname, True)
    for start_char, end_char in spans:
        doc.merge(start_char, end_char, ent_type=self.label)
    return doc

This is the other possible explanation: instead of the merging, we might be getting an error from the NER running over the text.

Actually, a though. I wonder whether the .merge() method is failing to set the .ent_iob attribute correctly when we supply an entity type during merging. So maybe it's the combination. I'll check.

Update: Yes that seems to be the case. :tada:
Happy to have gotten to the bottom of this --- it's been a problem for spaCy for a while. It's hard to debug because it's the combination merge + set entity type + apply NER + conflicting prediction.