Thanks for updating and sorry if this has been frustrating. I actually felt inspired by this discussion and built a little example model – see below for the full code
One thing that’s important to keep in mind with this approach is that you do need to package your model, install it via pip and load it in from the package, rather than loading it in from a path. Otherwise, your custom code in the model’s __init__.py
won’t be executed. (If you’re loading from a path, spaCy will only refer to the model’s meta.json
and not actually run the package.)
python -m spacy package /your_model /tmp
cd /tmp/your_model-0.0.0
python setup.py sdist
pip install dist/your_model-0.0.0.tar.gz
Code example
My code assumes that your model data directory contains an entity_matcher
directory with a patterns.json
file. In my example, I’m using a JSON file with an object keyed by entity label, e.g. {"GPE": [...]}
. It also adds a custom via_patterns
attribute to the spans that lets you see whether an entity was added via the matcher when you use the model in spaCy. This is just a little gimmick for example purposes – so you can leave it out if you don’t need it.
I’ve just used one of the default spaCy models, added "entity_matcher"
to the pipeline in the meta.json
, and used the following for the model package’s __init__.py
:
# coding: utf8
from __future__ import unicode_literals
from pathlib import Path
from spacy.util import load_model_from_init_py, get_model_meta
from spacy.language import Language
from spacy.tokens import Span
from spacy.matcher import Matcher
import ujson
__version__ = get_model_meta(Path(__file__).parent)['version']
def load(**overrides):
Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg)
return load_model_from_init_py(__file__, **overrides)
class EntityMatcher(object):
name = 'entity_matcher'
def __init__(self, nlp, **cfg):
Span.set_extension('via_patterns', default=False)
self.filename = 'patterns.json'
self.patterns = {}
self.matcher = Matcher(nlp.vocab)
def __call__(self, doc):
matches = self.matcher(doc)
spans = []
for match_id, start, end in matches:
span = Span(doc, start, end, label=match_id)
span._.via_patterns = True
spans.append(span)
doc.ents = list(doc.ents) + spans
return doc
def from_disk(self, path, **cfg):
patterns_path = path / self.filename
with patterns_path.open('r', encoding='utf8') as f:
self.from_bytes(f)
return self
def to_disk(self, path):
patterns = self.to_bytes()
patterns_path = Path(path) / self.filename
patterns_path.open('w', encoding='utf8').write(patterns)
def from_bytes(self, bytes_data):
self.patterns = ujson.load(bytes_data)
for label, patterns in self.patterns.items():
self.matcher.add(label, None, *patterns)
return self
def to_bytes(self, **cfg):
return ujson.dumps(self.patterns, indent=2, ensure_ascii=False)
You can also modify the code to take a path to a patterns file, instead of loading the patterns from the model data. This depends on whether you want to ship your patterns with the model, or swap them out. (Shipping your data with the model can be nice if you intend to share it with others – so you can send the .tar.gz
model to someone else on your team, and they’ll be able to just pip install
and use it straight away.)
As I mentioned above, the **cfg
settings are passed down to the component from spacy.load
, so instead of the from_disk
and from_bytes
methods, you can also just get the path or a list of patterns from the config parameters:
patterns_path = cfg.get('patterns_path') # get the path and then read it in
patterns = cfg.get('patterns', []) # get a list of patterns
nlp = spacy.load('your_model', patterns_path='patterns.json')
nlp = spacy.load('your_model', patterns=[{'LOWER': 'foo'}])
Hope this helps!