Hi @angelo,
I see what your intention is, so let me first explain what's happening under the hood, because the way it's set up right now will most likely not work as expected.
When you pass -m es_core_news_lg, you're not "adding" to the model — you're continuing to train its existing NER weights on your data. The thing to know about NER training is that every example is treated as complete: if a span isn't labeled, the model learns that it should not be an entity there. So if your 23k examples only contain your 3 new labels (and no PER/LOC/ORG), then every example is implicitly teaching the model "there are no people, locations or organizations here" — and step by step it stops predicting them. This is called catastrophic forgetting: train only on the new labels and the original ones quietly degrade.
What I'd recommend instead is keep them as two separate NER components. Rather than overwrite the pretrained NER component, train a fresh NER component for your custom labels and run it alongside the original one in the same pipeline. spaCy has an official project showing exactly this. It walks through the ways to combine two trained NER components and the tradeoffs of each.
One important assumption for this to work cleanly: your custom labels and the pretrained PER/LOC/ORG shouldn't compete for the same spans. Because the two components reason independently, this approach is a great fit when your domain entities occupy different text than people/locations/orgs — but if you find them frequently fighting over the same tokens (e.g. an org name that's also one of your custom types), the cleaner option is a single combined model: pre-annotate the originals with the stock model, merge with your gold labels (so that the training dataset contains all the labels), and train one component that resolves the conflicts during training. For adding distinct domain labels on top of stock NER, the two-component route is the right call.
The one technical detail to know: doc.ents can only hold one entity per token, so two ner components writing to it will overwrite each other. The clean fix (covered in the project) is to give them distinct names and have your custom one write to its own span group:
nlp.add_pipe("ner", name="ner_default") # pretrained PER/LOC/ORG, untouched
nlp.add_pipe("ner", name="ner_custom", ...) # your 3 labels, written to doc.spans["custom_ents"]
This way the pretrained PER/LOC/ORG stays fully intact (no forgetting, nothing to re-annotate), your custom labels live in their own component you can retrain freely, and you read both sets of results side by side. It also scales cleanly when you add LAW later — it just joins the custom component.
About pre-trained embeddings - you can defnitely use them for you custon NER component . The Spanish word vectors in es_core_news_lg are a static lookup table. They are not trained when you train an NER component — training only updates that component's own weights, never the vectors. So you should point your new custom component at es_core_news_lg's vectors and use them as its features (set vectors = "es_core_news_lg" in the config, or initialize from that base). Your custom NER then gets the full benefit of the pretrained Spanish embeddings — which is exactly what helps it generalize to entities it didn't see in training — and those same vectors remain untouched and available for your later downstream features. Nothing about training your custom labels degrades or alters them.
For DATE, MONEY, PERCENTAGE, given these are regular, well-formatted entities, a statistical model is overkill. Consider using an entity_ruler with token patterns/regex for them. It's more reliable, fully debuggable, and saves you annotation effort. Put it in the pipeline alongside the NER components.
Also, training your custom component with a config sized for 23k examples should converge in minutes on the M2 CPU.
I really recommend reading spaCy documentation on training especially on how config files work.