Training multiple textcat models at once with joblib

tsoernes · September 3, 2019, 10:12am

I'm using joblib to train multiple textcat models at once. I'm getting this warning -- should I be worried?

Warning: Unnamed vectors -- this won't allow multiple vectors models to be loaded. (Shape: (684831, 300))

Here is a rough outdraft of the code -- is it sensible to use the same nlp object for all jobs?

from joblib import Parallel, delayed
from functools import partial

def main()
    nlp = spacy.load('en_core_web_lg')
    trainer_ = delayed(partial(
        trainer,
        nlp=nlp,
        labels=labels,
        n_iter=n_iter,
        dropout=dropout,
        learn_rate=learn_rate,
        batch_start=batch_start,
        batch_max=batch_max
    ))
    executor = Parallel(n_jobs=4, backend="multiprocessing", prefer="processes")
    tasks = (trainer_(tdata) for tdata in output_dirs, train_data)
    executor(tasks)


def trainer(
    train_data,
    nlp,
    labels,
    n_iter,
    dropout,
    learn_rate,
    batch_start,
    batch_max,
):
    config = {"exclusive_classes": False, "architecture": 'bow'}
    textcat = nlp.create_pipe("textcat", config=config)
    nlp.add_pipe(textcat, last=True)

    for label in labels:
        textcat.add_label(label)

    batch_sizes = compounding(batch_start, batch_max, 1.001)
    other_pipes = [pipe for pipe in nlp.pipe_names if not pipe != "textcat"]
    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.begin_training()
        for epoch in range(1, n_iter + 1):
            losses = {}
            random.shuffle(train_data)
            batches = minibatch(train_data, size=batch_sizes)
            for batch in batches:
                texts, cats = zip(*batch)
                nlp.update(texts, cats, sgd=optimizer, drop=dropout, losses=losses)

I'm only getting 2 cores active even though n_jobs=4; which is strange.

honnibal · September 4, 2019, 2:40pm

Yeah you should avoid using the same NLP object, and instead just load the NLP object within the subprocess. If you pass the nlp object like that, it'll be less efficient too --- because that has to pickle the object and unpickle it, which is slower than just loading it.

Topic		Replies	Views
Use textcat and textcat_multilabel in the same model textcat , spacy	1	347	May 19, 2022
Unnamed vectors -- this won't allow multiple vectors models to be loaded spacy	4	4563	July 18, 2018
Multiple, separate text classifications in a single model usage , textcat , solved	12	2884	September 28, 2021
Traning/validation in Textcat/ textcat , spacy , off-topic	0	1181	May 26, 2020
A model with multiple labels or multiple models with a single label? usage , textcat , solved	2	453	February 10, 2020

Training multiple textcat models at once with joblib

Related topics