adding custom attribute to doc, having NER use attribute

ines · March 5, 2018, 3:54pm

Thanks for updating and sorry if this has been frustrating. I actually felt inspired by this discussion and built a little example model – see below for the full code

One thing that’s important to keep in mind with this approach is that you do need to package your model, install it via pip and load it in from the package, rather than loading it in from a path. Otherwise, your custom code in the model’s __init__.py won’t be executed. (If you’re loading from a path, spaCy will only refer to the model’s meta.json and not actually run the package.)

python -m spacy package /your_model /tmp
cd /tmp/your_model-0.0.0
python setup.py sdist
pip install dist/your_model-0.0.0.tar.gz

Code example

My code assumes that your model data directory contains an entity_matcher directory with a patterns.json file. In my example, I’m using a JSON file with an object keyed by entity label, e.g. {"GPE": [...]}. It also adds a custom via_patterns attribute to the spans that lets you see whether an entity was added via the matcher when you use the model in spaCy. This is just a little gimmick for example purposes – so you can leave it out if you don’t need it.

I’ve just used one of the default spaCy models, added "entity_matcher" to the pipeline in the meta.json, and used the following for the model package’s __init__.py:

# coding: utf8
from __future__ import unicode_literals

from pathlib import Path
from spacy.util import load_model_from_init_py, get_model_meta
from spacy.language import Language
from spacy.tokens import Span
from spacy.matcher import Matcher
import ujson


__version__ = get_model_meta(Path(__file__).parent)['version']


def load(**overrides):
    Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
    return load_model_from_init_py(__file__, **overrides)


class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, **cfg):
        Span.set_extension('via_patterns', default=False)
        self.filename = 'patterns.json'
        self.patterns = {}
        self.matcher = Matcher(nlp.vocab)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            span._.via_patterns = True
            spans.append(span)
        doc.ents = list(doc.ents) + spans
        return doc

    def from_disk(self, path, **cfg):
        patterns_path = path / self.filename
        with patterns_path.open('r', encoding='utf8') as f:
            self.from_bytes(f)
        return self

    def to_disk(self, path):
        patterns = self.to_bytes()
        patterns_path = Path(path) / self.filename
        patterns_path.open('w', encoding='utf8').write(patterns)

    def from_bytes(self, bytes_data):
        self.patterns = ujson.load(bytes_data)
        for label, patterns in self.patterns.items():
            self.matcher.add(label, None, *patterns)
        return self

    def to_bytes(self, **cfg):
        return ujson.dumps(self.patterns, indent=2, ensure_ascii=False)

You can also modify the code to take a path to a patterns file, instead of loading the patterns from the model data. This depends on whether you want to ship your patterns with the model, or swap them out. (Shipping your data with the model can be nice if you intend to share it with others – so you can send the .tar.gz model to someone else on your team, and they’ll be able to just pip install and use it straight away.)

As I mentioned above, the **cfg settings are passed down to the component from spacy.load, so instead of the from_disk and from_bytes methods, you can also just get the path or a list of patterns from the config parameters:

patterns_path = cfg.get('patterns_path')  # get the path and then read it in
patterns = cfg.get('patterns', [])  # get a list of patterns

nlp = spacy.load('your_model', patterns_path='patterns.json')
nlp = spacy.load('your_model', patterns=[{'LOWER': 'foo'}])

Hope this helps!

Topic		Replies	Views
Custom ner recipe doesn't work with patterns ner	10	628	April 9, 2020
textcat.manual with --patterns argument enhancement , textcat	7	1097	September 25, 2019
Question about EntityRecognizer usage , ner	5	811	July 29, 2020
Adding a custom NER to a pipeline overrides an original NER usage , ner , spacy	5	4171	September 24, 2018
Excluding patterns for NER enhancement , usage , ner	2	723	May 9, 2019

adding custom attribute to doc, having NER use attribute

Code example

Related topics