I've been running spacy pretrain for around 16 hours at this point (on an AWS p3.2xlarge instance with a large Tesla V100 GPU). The loss seems to be decreasing, but I have no idea if it is good enough to stop (or when to stop).
My spacy version is 2.2.1.
Pretrain is still running... current output below... I believe the second to last column is the loss column... started out as a 6 digit number, and is now hovering around 37k and seems to be slowly decreasing over time (but no way to be sure).
There should be a log.jsonl that provides an epoch_loss, which makes it easier to see whether it's still learning.
If you have a dataset ready, you can already just start using the models that have been written out. You can check the most recent model, and then the one half way through training, 75% of the way through training, etc to see whether it's still improving.
If you've kept the default settings, I've usually found that it takes about 12-24 hours for accuracy to stop getting better with the models produced. It depends on how much text you're pretraining with though.
Btw I've been trying to developI some improvements to the pretraining myself at the moment. If you don't mind that your models will be a bit slower on CPU, and you have a lot of text to pretrain with, you can try setting the width to 300, depth to 8, and the embedding rows to 20k. The pretrain arguments are: -er 20000 -cw 300 -cd 8, while the spacy train environment variables would be embed_size=20000 token_vector_width=300 conv_depth=8 (we're working on fixing the pretty terrible hyper-parameter configuration stuff). If nothing else, the best parameter to change is the embedding size. The v2.1 models set this a bit too low, especially for pretraining. Setting it to 20000 doesn't make the models slower, and might improve your accuracy a bit.
I'll re-run pretrain with the new hyperparameters suggested. Will let you know how it goes.
I'm making heavy use of pretrain nowadays, and am reasonably familiar with the space. If there is something I can do to help, e.g. try out a preview of a new feature, please do let me know.
@honnibal what type of GPU do you use with these settings? The Tesla V100 I am using (16GB GPU memory) seems to run out of memory when I specify the full set of hyperparameters recommended:
$> spacy pretrain tasks/all.jsonl en_core_web_lg ./models/pretrain/v0.2/temp/ -er 20000 -cw 300 -cd 8
...
File "cupy/cuda/memory.pyx", line 1000, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 734, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 543,124,992 bytes (allocated so far: 15,319,202,816 bytes).
However, if I drop the "-cw 300 -cd 8" and only specify the "-er 20000" then pretrain seems to have started... I'm monitoring the loss now.
I've got an 11gb card so it should work. Try setting a lower max sequence length, -xw 256. You can also try a small batch size, e.g. -bs 1000. The default of 3k should be okay though.
Here's a couple of quick notebook functions I'm using to graph the losses:
def plot_losses(title, name, experiments, normalize=False):
dataset = []
for exp, losses in experiments.items():
if normalize:
# Normalize as % of x min loss value
min_loss = min(losses)
losses = [loss / min_loss for loss in losses]
dataset.extend([[exp, i, loss] for i, loss in enumerate(losses)])
df = pd.DataFrame(dataset, columns=["er", "epoch", "loss"])
fig = px.line(df, x="epoch", y="loss", color=name, line_group=name, hover_name="epoch",
line_shape="spline", render_mode="svg", title=title)
fig.show()
def get_losses(path):
records = [json.loads(line) for line in open(path)]
losses = [rec["epoch_loss"] for rec in records]
return losses
width128= get_losses("cw128/log.jsonl")
width256= get_losses("cw256/log.jsonl")
plot_losses("Losses", "Width", {"w128": width128, "w256": width256}, normalize=False)
Thanks, @honnibal! Seemed to work well though I had to reduce the batch size. In case others are curious, here's the set of hyperparameters I ended up using (Tesla V100 GPU with 16GB memory):
The loss plot is shown below... I could probably have stopped it sooner but I got busy with other stuff and let it run longer than I necessarily needed to: