How long should I run spacy pretrain for?

I've been running spacy pretrain for around 16 hours at this point (on an AWS p3.2xlarge instance with a large Tesla V100 GPU). The loss seems to be decreasing, but I have no idea if it is good enough to stop (or when to stop).

My spacy version is 2.2.1.

Pretrain is still running... current output below... I believe the second to last column is the loss column... started out as a 6 digit number, and is now hovering around 37k and seems to be slowly decreasing over time (but no way to be sure).

18 730558708 163926180 38023 54958
18 730759926 163963661 37480 54348
18 730958539 164000447 36786 53805
18 731159531 164037821 37373 54286
18 731360158 164074997 37176 53963
18 731560372 164112557 37560 54454
18 731761869 164150322 37765 53927
18 731962011 164187434 37111 54431
18 732159365 164224144 36710 53936
18 732360274 164261508 37363 54047
18 732554249 164298160 36652 53012
18 732756326 164335688 37528 54587
18 732956530 164373039 37350 54390

Hi @tjaffri,

There should be a log.jsonl that provides an epoch_loss, which makes it easier to see whether it's still learning.

If you have a dataset ready, you can already just start using the models that have been written out. You can check the most recent model, and then the one half way through training, 75% of the way through training, etc to see whether it's still improving.

If you've kept the default settings, I've usually found that it takes about 12-24 hours for accuracy to stop getting better with the models produced. It depends on how much text you're pretraining with though.

Btw I've been trying to developI some improvements to the pretraining myself at the moment. If you don't mind that your models will be a bit slower on CPU, and you have a lot of text to pretrain with, you can try setting the width to 300, depth to 8, and the embedding rows to 20k. The pretrain arguments are: -er 20000 -cw 300 -cd 8, while the spacy train environment variables would be embed_size=20000 token_vector_width=300 conv_depth=8 (we're working on fixing the pretty terrible hyper-parameter configuration stuff). If nothing else, the best parameter to change is the embedding size. The v2.1 models set this a bit too low, especially for pretraining. Setting it to 20000 doesn't make the models slower, and might improve your accuracy a bit.

Awesome, thanks @honnibal!

  1. I'll re-run pretrain with the new hyperparameters suggested. Will let you know how it goes.

  2. I'm making heavy use of pretrain nowadays, and am reasonably familiar with the space. If there is something I can do to help, e.g. try out a preview of a new feature, please do let me know.

@honnibal what type of GPU do you use with these settings? The Tesla V100 I am using (16GB GPU memory) seems to run out of memory when I specify the full set of hyperparameters recommended:

$> spacy pretrain tasks/all.jsonl en_core_web_lg ./models/pretrain/v0.2/temp/ -er 20000 -cw 300 -cd 8
...
File "cupy/cuda/memory.pyx", line 1000, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 734, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 543,124,992 bytes (allocated so far: 15,319,202,816 bytes).

However, if I drop the "-cw 300 -cd 8" and only specify the "-er 20000" then pretrain seems to have started... I'm monitoring the loss now.

I've got an 11gb card so it should work. Try setting a lower max sequence length, -xw 256. You can also try a small batch size, e.g. -bs 1000. The default of 3k should be okay though.

Here's a couple of quick notebook functions I'm using to graph the losses:

def plot_losses(title, name, experiments, normalize=False):
    dataset = []
    for exp, losses in experiments.items():
        if normalize:
            # Normalize as % of x min loss value
            min_loss = min(losses)
            losses = [loss / min_loss for loss in losses]
        dataset.extend([[exp, i, loss] for i, loss in enumerate(losses)])
    df = pd.DataFrame(dataset, columns=["er", "epoch", "loss"])
    fig = px.line(df, x="epoch", y="loss", color=name, line_group=name, hover_name="epoch",
        line_shape="spline", render_mode="svg", title=title)
    fig.show()
    
def get_losses(path):
    records = [json.loads(line) for line in open(path)]
    losses = [rec["epoch_loss"] for rec in records]
    return losses

width128= get_losses("cw128/log.jsonl")
width256= get_losses("cw256/log.jsonl")
plot_losses("Losses", "Width", {"w128": width128, "w256": width256}, normalize=False)

Thanks, @honnibal! Seemed to work well though I had to reduce the batch size. In case others are curious, here's the set of hyperparameters I ended up using (Tesla V100 GPU with 16GB memory):

spacy pretrain tasks/all.jsonl en_core_web_lg \
    ./models/pretrain/temp/ -er 20000 -cw 300 -cd 8 -bs 1000

The loss plot is shown below... I could probably have stopped it sooner but I got busy with other stuff and let it run longer than I necessarily needed to: