Does the outputted model contain the custom pipeline components?

Shane · November 26, 2018, 6:39am

Hi,

Based on the sample PhraseMatcher code in the (link), I have extended the GPE entity category to include some of the Australian Suburbs & States.

After running nlp.add_pipe(entity_matcher, after='ner') and adding these location names into the pipeline, does the outputted model (nlp.to_disk()) capture these new locations?

An additional question: From the best practice post, I found that

PhraseMatcher before NER would transform matched words/phrases into an entity and NER will not touch it afterwards.

So what would be the difference if I add the entity_matcher after ner? E.g. nlp.add_pipe(entity_matcher, after='ner')

Thanks,
Shane

ines · November 26, 2018, 11:31am

The model you're saving out here will preserve the original pipeline – e.g. ['ner', 'entity_matcher'] – but it won't actually include your custom code. This is important, because theto_disk/from_disk methods should never quietly save and eval (!) arbitrary code. They only save out data.

So when you're loading your model back in, you'll need to make sure to add an entry to the Language.factories (see here) that tells spaCy how to initialize the pipeline component entity_matcher.

Alternatively, you can also include your custom component with your model, by turning it into a Python package (that can ship code). My comment on this thread explains this in more detail:

github.com/explosion/spaCy

How to package a completely external NER model with spacy for use in prodigy

opened 03:16PM - 15 Aug 18 UTC

closed 12:45PM - 12 Sep 18 UTC

hannahlindsley

docs feat / ner feat / serialize ✨ prodigy

I'm unclear on how I can take an existing, external-to-spacy NER model (a crf) a…nd package it for use in prodigy. I've been able to do it for a custom tokenizer by extending the spacy Tokenizer class, but I think I'm barking up all kinds of wrong trees when I try to do it for NER. Am I missing this in the doc? All I've been able to ascertain so far is how to _retrain_ an existing spacy model, which is not what I'm looking to do. Thanks! ## Which page or section is this issue related to? https://spacy.io/usage/training#section-ner https://spacy.io/usage/training#saving-loading https://spacy.io/api/language#to_disk

In that case, it'd still add the entities to the doc.ents – however, you'd have to take care of reconciling duplicates and overlapping matches. By definition, one token can only be part of one entity, and you'd have to decide which entities should take precedence (the ones set by the statistical NER or the ones set by your custom component). This depends on your use case, the specific entities etc.

Since rule-based NER is something people are really interested in, I've written a simple built-in component for spaCy v2.1.x (currently available for testing as spacy-nightly). You can check out the code and see what it does in addition to just adding the entities (in order to make it interoperate with the statistical entity recognizer):

github.com/explosion/spaCy

💫 Rule-based NER component

explosion:develop ← explosion:feature/rule-based-ner-component

opened 10:34AM - 05 Jul 18 UTC

ines

+273 -5

## Description This is probably one of the most common pipeline components bu…ilt by users, so we want to ship a reusable component and factory with the core library. The component takes match patterns, each with an assigned label, finds them in `Doc` and adds them to the `doc.ents`. This makes it easier to combine rule-based and statistical approaches to NER (e.g. use both the model's predictions and a lexicon.) Pattern files can also be shipped with a model package. ### Features * Accepts both token patterns (one dict per token describing its attributes) or phrase patterns (exact string matches). * Can be added before or after an existing entity recognizer in the pipeline. If added before, spaCy's NER model will respect the existing entities and the new constraints, which can potentially lead to better accuracy. If added after, you can optionally set `overwrite_ents=True` to overwrite existing entities. * If matches overlap (e.g. via token patterns with operators), the component will try to find the best possible combination of entities based on the matches. * When you save out the `nlp` object, the patterns will be saved as a `.jsonl` file in the model directory. This lets you ship the patterns **with your model package** 🎉 * Bonus: The patterns have the same format as [Prodigy](https://prodi.gy)'s pattern files. So if you're using Prodigy, you can load in your existing pattern files. ### Usage example Consider a patterns file like this: ```json {"label": "ORG", "pattern": "Apple"} {"label": "GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]} ``` ```python import spacy from spacy.pipeline import EntityRuler nlp = spacy.load('en_core_web_sm') ruler = EntityRuler(nlp).from_disk('patterns.jsonl') nlp.add_pipe(ruler, before='ner') ``` ### Todo - [ ] Figure out the naming – see [this Twitter thread](https://twitter.com/_inesmontani/status/1010961474532073472) ### Types of change enhancement ## Checklist  - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

Shane · November 28, 2018, 12:50am

Thank you for the detailed explanation Ines! This gives me more than enough to get started

Just out of curiosity, what would be the difference between the current entity_matcher and EntityRuler in the spacy-nightly?

ines · November 28, 2018, 1:14pm

You can check out the EntityRuler code here – the way it works is very similar, but its API is closer to the other matcher APIS, and it has a few more methods (checking if a pattern exists, exporting/importing patterns to and from JSONL etc.). It also lets you set an argument to overwrite existing entities (if the component is added after the regular entity recognizer, for instance) and handles overlapping matches (by only selecting the largest span – since a token can only be part of one entity, so entity spans can't overlap).

Topic		Replies	Views
Adding a custom NER to a pipeline overrides an original NER usage , ner , spacy	5	4194	September 24, 2018
Add custom NER model from prodigy to spacy pipeline usage , ner , spacy , solved	3	2342	October 5, 2022
adding custom attribute to doc, having NER use attribute ner , spacy	11	5437	March 9, 2018
Is it possible for the entities tagged and merged in one document to be respected when passed to another spacy.load() model? usage , ner , spacy	3	513	December 3, 2020
Creating new model usage , spacy	6	2196	March 1, 2018

Does the outputted model contain the custom pipeline components?

Related topics