Training baseline scores vary despite random seed fixed

I have been using prodigy to train a 'texcat' model like so:

python -m prodigy train textcat my_annotations en_vectors_web_lg --output ./my_model

and I noticed that the baseline score hugely varies between runs (0.2-0.55). This is even more puzzling to me given fix_random_seed(0) is called at the beginning of training.

I tracked down these variations to be coming from the model output so I created a minimal example to re-create this behaviour:

import spacy

component = 'textcat'
pipe_cfg = {"exclusive_classes": False}

for i in range(5):

    nlp = spacy.load('en_vectors_web_lg')

    example = ("Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
                {'cats': {'Labe1': 1.0, 'Label2': 0.0, 'Label3': 0.0}})

    # Set up component pipe
    nlp.add_pipe(nlp.create_pipe(component, config=pipe_cfg), last=True)
    pipe = nlp.get_pipe(component)
    for label in set(example[1]['cats']):

    # Set up training and optimiser
    optimizer = nlp.begin_training(component_cfg={component: pipe_cfg })

    # Run one document through textcat NN for scoring
    print(f"Scoring '{example[0]}'")
    print(f"Result: {pipe.model([nlp.make_doc(example[0])])}")
    print(f"First layer output: {pipe.model._layers[0]([nlp.make_doc(example[0])])}")

Calling fix_random_seeds should create the same output given a fixed seed and no weight updates as far as I understand. It does indeed in the linear model but not the CNN model if I understand the architecture of the model correctly (
So the output from the first half of the first layer stays the same for each iteration but the second half does not (last line printed out at each iteration).

Is this expected behaviour?

My setup:
Python 3.7.7

Thanks for the report, and no, this isn't expected behaviour. We've had a number of these bugs that introduce non-deterministic behaviour, which is definitely undesirable. GPU has been particularly affected by this but sometimes it's crept into the CPU models as well.

As you point out it's an issue in spaCy rather than Prodigy. If you like you could make an issue on the tracker and link this thread there. I'm also happy to make the thread instead if you don't have a Github account or would rather not take the time.

I think spaCy v2.3 has some fixes for non-determinism that might address this case. But if not, we definitely want to track it down.

In spaCy v3 we're adopting an integration with Data Version Control, which checksums trained assets. This will make non-determinism much more prominent in our workflows, which should allow us to address this problem.

Thanks for getting back to me so quickly @honnibal. I have raised this as an issue on GitHub now.