training on MACOS M2 GPU

angelo · June 5, 2026, 5:02pm

hi there

i have a some label dataset with 23.000 samples on 3 labels, and is taking forever to train my model (more than 1h) and train-curve take almost 4h, is using just a single core in my MACOS M2 GPU.

Is it possible to use GPU in MACOS? please let me know.

or at least running training on multiple CORE.

thanks a lot

angelo

magdaaniol · June 8, 2026, 8:29am

Hi @angelo!

First, some clarifications that are relevant regardless of your exact setup:

Training is single-process by design. spaCy/Prodigy don't spread the training loop across CPU cores, so seeing one core maxed out is expected — not a bug, and "use more cores" unfortunately isn't a lever here.
The GPU is opt-in, and only helps transformer pipelines. spaCy/Prodigy won't use the Apple Silicon GPU automatically — you have to request it with --gpu-id 0 (the default is -1, meaning CPU) when using prodigy train recipe . On your M2 there's only one GPU, so the ID is always 0. But the GPU (via Metal/MPS) only accelerates transformer models — for a CNN/vectors pipeline it makes little to no difference.
For 23k examples on 3 labels, this should be fast. A CNN text-classification model (using vectors from en_core_web_lg) typically trains in minutes on the M2 CPU — no GPU needed. An hour-plus usually means a transformer model is running on CPU. The ~4h train-curve is expected on top of that, since train-curve retrains several times on increasing slices of your data (so it's roughly 4× a single run).

To give you the exact fix, could you share two things?

The exact prodigy train command you ran (did it include --gpu-id?)
The base model / config — is it transformer-based (e.g. en_core_web_trf, or a config with a transformer component) or a CNN/vectors pipeline (e.g. en_core_web_lg)?

For a 3-label classification task, the CNN route is usually the faster and simpler win — but I'll confirm once I see what you're running.

If you want to check GPU availability in the meantime, you can run this in Python:

import spacy
spacy.require_gpu(0)
from thinc.api import get_current_ops
print(get_current_ops())   # <...MPSOps...> means the Apple GPU is usable

angelo · June 9, 2026, 1:39pm

hi there
thanks for your answer, here is the commands I am running:

date; time python -m prodigy train NER_base_lg_llm_trained --ner NER_base_lg_llm11_review --lang es --label-stats -m es_core_news_lg ; date

date; time python -m prodigy train-curve --ner NER_base_lg_llm11_review --show-plot ; date

angelo · June 15, 2026, 4:23pm

hi there:

do you have any news for me? please

magdaaniol · June 16, 2026, 7:28am

Hi @angelo,

Apologies for the delay and thanks for additional info - that helps a lot. The key detail is that es_core_news_lg is a CNN/vectors pipeline, not a transformer, which means that the GPU won't make this pipeline faster. MPS (the Apple GPU) only accelerates transformer models as I mentioned before

That said, there's a more important detail in your exact command worth stepping back on.

You're not training a fresh 3-label model — you're fine-tuning the existing Spanish NER. Because you pass -m es_core_news_lg, Prodigy loads that model's ner and tok2vec components with their pretrained weights and continues training them, and it also inherits that model's training schedule instead of generating one for your dataset. Consequently, you're also inheriting the oversized training schedule. Your run uses the values baked into es_core_news_lg — patience = 5000, max_steps = 100000, eval_frequency = 1000 — a schedule built to train the full Spanish pipeline on a large corpus, not to fine-tune NER on 23k examples.

The model likely converges quickly (fine-tuning from pretrained weights usually does) and then keeps training long after, re-scoring a held-out 20% of your data every 1,000 steps until the score fails to improve for 5,000 steps. That, plus document length (time scales with total tokens, not example count), is where your hour-plus most likely goes — not the GPU or cores.

So before we change anything, the key question is: do you want to fine-tune the Spanish NER, or train a clean model for just your 3 labels?

angelo · June 17, 2026, 12:35pm

hi there,

thanks for your answer, I am trying to expend a model to include embeddings and NER from some particular domains, my task is to do basic NER (PER, LOC, ORG) and later include DATE, LAW, MONEY, PERCENTAGE.
that is to filter documents with that entities, later I am planning to add coreference, and another set of features.

that been said I am not sure to about start from a clean model, as the embedding should be useful in later task.

do you have any advice for me ? please.

thanks

angelo

magdaaniol · June 18, 2026, 8:19am

Hi @angelo,

I see what your intention is, so let me first explain what's happening under the hood, because the way it's set up right now will most likely not work as expected.

When you pass -m es_core_news_lg, you're not "adding" to the model — you're continuing to train its existing NER weights on your data. The thing to know about NER training is that every example is treated as complete: if a span isn't labeled, the model learns that it should not be an entity there. So if your 23k examples only contain your 3 new labels (and no PER/LOC/ORG), then every example is implicitly teaching the model "there are no people, locations or organizations here" — and step by step it stops predicting them. This is called catastrophic forgetting: train only on the new labels and the original ones quietly degrade.

What I'd recommend instead is keep them as two separate NER components. Rather than overwrite the pretrained NER component, train a fresh NER component for your custom labels and run it alongside the original one in the same pipeline. spaCy has an official project showing exactly this. It walks through the ways to combine two trained NER components and the tradeoffs of each.

One important assumption for this to work cleanly: your custom labels and the pretrained PER/LOC/ORG shouldn't compete for the same spans. Because the two components reason independently, this approach is a great fit when your domain entities occupy different text than people/locations/orgs — but if you find them frequently fighting over the same tokens (e.g. an org name that's also one of your custom types), the cleaner option is a single combined model: pre-annotate the originals with the stock model, merge with your gold labels (so that the training dataset contains all the labels), and train one component that resolves the conflicts during training. For adding distinct domain labels on top of stock NER, the two-component route is the right call.

The one technical detail to know: doc.ents can only hold one entity per token, so two ner components writing to it will overwrite each other. The clean fix (covered in the project) is to give them distinct names and have your custom one write to its own span group:

nlp.add_pipe("ner", name="ner_default")          # pretrained PER/LOC/ORG, untouched
nlp.add_pipe("ner", name="ner_custom", ...)      # your 3 labels, written to doc.spans["custom_ents"]

This way the pretrained PER/LOC/ORG stays fully intact (no forgetting, nothing to re-annotate), your custom labels live in their own component you can retrain freely, and you read both sets of results side by side. It also scales cleanly when you add LAW later — it just joins the custom component.

About pre-trained embeddings - you can defnitely use them for you custon NER component . The Spanish word vectors in es_core_news_lg are a static lookup table. They are not trained when you train an NER component — training only updates that component's own weights, never the vectors. So you should point your new custom component at es_core_news_lg's vectors and use them as its features (set vectors = "es_core_news_lg" in the config, or initialize from that base). Your custom NER then gets the full benefit of the pretrained Spanish embeddings — which is exactly what helps it generalize to entities it didn't see in training — and those same vectors remain untouched and available for your later downstream features. Nothing about training your custom labels degrades or alters them.

For DATE, MONEY, PERCENTAGE, given these are regular, well-formatted entities, a statistical model is overkill. Consider using an entity_ruler with token patterns/regex for them. It's more reliable, fully debuggable, and saves you annotation effort. Put it in the pipeline alongside the NER components.

Also, training your custom component with a config sized for 23k examples should converge in minutes on the M2 CPU.

I really recommend reading spaCy documentation on training especially on how config files work.

Topic		Replies	Views
Training NER does not make any progress usage , ner , training	3	895	December 16, 2021
How to use GPU to accelerate the train of NER tasks? training	5	2545	August 25, 2021
Will a GPU make training faster? spacy	7	8052	July 20, 2018
Multiple GPU support for training? enhancement , usage , spacy , training	1	727	November 15, 2021
Running on CPU usage , solved , training	4	1529	February 3, 2022

training on MACOS M2 GPU

Related topics