Error when loading models trained with more than 999 samples

I found a peculiar error where if my annotated dataset is at 1000 samples it gives a dimension error.

Let’s say I have an annotated-data.json with more than 1000 samples. When I have 999 samples I can load the model ok, but not at 1000 samples.

prodigy drop test
prodigy dataset test "yoyo"
head -n 1000 annotated-data.jsonl > small-data.jsonl
prodigy db-in test small-data.jsonl
prodigy textcat.batch-train test en_core_web_sm --output test -n 1
python -c "import spacy; spacy.load('test')"

Here is the error.

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/", line 13, in load
    return util.load_model(name, **overrides)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/", line 107, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/", line 138, in load_model_from_path
    return nlp.from_disk(model_path)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/", line 541, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/", line 483, in from_disk
    reader(path / key)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/", line 537, in <lambda>
    deserializers[] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "spacy/pipeline.pyx", line 170, in spacy.pipeline.BaseThincComponent.from_disk (spacy/pipeline.cpp:11298)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/", line 483, in from_disk
    reader(path / key)
  File "spacy/pipeline.pyx", line 163, in spacy.pipeline.BaseThincComponent.from_disk.load_model (spacy/pipeline.cpp:10856)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/thinc/neural/_classes/", line 352, in from_bytes
    copy_array(dest, param[b'value'])
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/thinc/neural/", line 48, in copy_array
    dst[:] = src
ValueError: could not broadcast input array from shape (128) into shape (64)

Thanks for the report.

As weird as this issue sounds, it does make sense: there’s a heuristic in spaCy’s textcat class that uses a smaller neural net architecture if very few examples are available. The cutoff is 1000 examples. It appears the larger-text model is broken. I thought I had a test for this, but apparently not!

The issue only occurs when there are no pre-trained vectors loaded (as in the en_core_web_sm model). I’d been testing with a vectors model such as en_vectors_web_lg, which is why the problem didn’t show up before.

The root cause is actually a spaCy bug, which should be fixed in the next release of spacy-nightly – which is not quite nightly, because the models take some time to retrain :p.

In the meantime, if you base your textcat stuff off en_vectors_web_lg, you’ll be able to take advantage of the GloVe vectors, and both the small and large models should work properly.

Thanks again for the report!

Will be fixed in the upcoming Prodigy v0.3.0! :tada:

What is the underlying spacy bug? Is there a way around it?

I’ve stumbled on the same problem (exception stack), but without using Prodigy…

PS: I’m using pt_core_news_sm model