adding custom attribute to doc, having NER use attribute

Thanks for updating and sorry if this has been frustrating. I actually felt inspired by this discussion and built a little example model – see below for the full code :blush:

One thing that’s important to keep in mind with this approach is that you do need to package your model, install it via pip and load it in from the package, rather than loading it in from a path. Otherwise, your custom code in the model’s __init__.py won’t be executed. (If you’re loading from a path, spaCy will only refer to the model’s meta.json and not actually run the package.)

python -m spacy package /your_model /tmp
cd /tmp/your_model-0.0.0
python setup.py sdist
pip install dist/your_model-0.0.0.tar.gz

Code example

My code assumes that your model data directory contains an entity_matcher directory with a patterns.json file. In my example, I’m using a JSON file with an object keyed by entity label, e.g. {"GPE": [...]}. It also adds a custom via_patterns attribute to the spans that lets you see whether an entity was added via the matcher when you use the model in spaCy. This is just a little gimmick for example purposes – so you can leave it out if you don’t need it.

I’ve just used one of the default spaCy models, added "entity_matcher" to the pipeline in the meta.json, and used the following for the model package’s __init__.py:

# coding: utf8
from __future__ import unicode_literals

from pathlib import Path
from spacy.util import load_model_from_init_py, get_model_meta
from spacy.language import Language
from spacy.tokens import Span
from spacy.matcher import Matcher
import ujson


__version__ = get_model_meta(Path(__file__).parent)['version']


def load(**overrides):
    Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
    return load_model_from_init_py(__file__, **overrides)


class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, **cfg):
        Span.set_extension('via_patterns', default=False)
        self.filename = 'patterns.json'
        self.patterns = {}
        self.matcher = Matcher(nlp.vocab)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            span._.via_patterns = True
            spans.append(span)
        doc.ents = list(doc.ents) + spans
        return doc

    def from_disk(self, path, **cfg):
        patterns_path = path / self.filename
        with patterns_path.open('r', encoding='utf8') as f:
            self.from_bytes(f)
        return self

    def to_disk(self, path):
        patterns = self.to_bytes()
        patterns_path = Path(path) / self.filename
        patterns_path.open('w', encoding='utf8').write(patterns)

    def from_bytes(self, bytes_data):
        self.patterns = ujson.load(bytes_data)
        for label, patterns in self.patterns.items():
            self.matcher.add(label, None, *patterns)
        return self

    def to_bytes(self, **cfg):
        return ujson.dumps(self.patterns, indent=2, ensure_ascii=False)

You can also modify the code to take a path to a patterns file, instead of loading the patterns from the model data. This depends on whether you want to ship your patterns with the model, or swap them out. (Shipping your data with the model can be nice if you intend to share it with others – so you can send the .tar.gz model to someone else on your team, and they’ll be able to just pip install and use it straight away.)

As I mentioned above, the **cfg settings are passed down to the component from spacy.load, so instead of the from_disk and from_bytes methods, you can also just get the path or a list of patterns from the config parameters:

patterns_path = cfg.get('patterns_path')  # get the path and then read it in
patterns = cfg.get('patterns', [])  # get a list of patterns
nlp = spacy.load('your_model', patterns_path='patterns.json')
nlp = spacy.load('your_model', patterns=[{'LOWER': 'foo'}])

Hope this helps!