Error when loading models trained with more than 999 samples

Hi, sorry to spam all these questions.

I found a peculiar error where if my annotated dataset is at 1000 samples it gives a dimension error.

Let’s say I have an annotated-data.json with more than 1000 samples. When I have 999 samples I can load the model ok, but not at 1000 samples.

prodigy drop test
prodigy dataset test "yoyo"
head -n 1000 annotated-data.jsonl > small-data.jsonl
prodigy db-in test small-data.jsonl
prodigy textcat.batch-train test en_core_web_sm --output test -n 1
python -c "import spacy; spacy.load('test')"

Here is the error.

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/__init__.py", line 13, in load
    return util.load_model(name, **overrides)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/util.py", line 107, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/util.py", line 138, in load_model_from_path
    return nlp.from_disk(model_path)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/language.py", line 541, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/util.py", line 483, in from_disk
    reader(path / key)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/language.py", line 537, in <lambda>
    deserializers[proc.name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "spacy/pipeline.pyx", line 170, in spacy.pipeline.BaseThincComponent.from_disk (spacy/pipeline.cpp:11298)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/spacy/util.py", line 483, in from_disk
    reader(path / key)
  File "spacy/pipeline.pyx", line 163, in spacy.pipeline.BaseThincComponent.from_disk.load_model (spacy/pipeline.cpp:10856)
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 352, in from_bytes
    copy_array(dest, param[b'value'])
  File "/Users/apewu/writelab/prodigy/lib/python3.6/site-packages/thinc/neural/util.py", line 48, in copy_array
    dst[:] = src
ValueError: could not broadcast input array from shape (128) into shape (64)

Thanks for the report.

As weird as this issue sounds, it does make sense: there’s a heuristic in spaCy’s textcat class that uses a smaller neural net architecture if very few examples are available. The cutoff is 1000 examples. It appears the larger-text model is broken. I thought I had a test for this, but apparently not!

1 Like

The issue only occurs when there are no pre-trained vectors loaded (as in the en_core_web_sm model). I’d been testing with a vectors model such as en_vectors_web_lg, which is why the problem didn’t show up before.

The root cause is actually a spaCy bug, which should be fixed in the next release of spacy-nightly – which is not quite nightly, because the models take some time to retrain :p.

In the meantime, if you base your textcat stuff off en_vectors_web_lg, you’ll be able to take advantage of the GloVe vectors, and both the small and large models should work properly.

Thanks again for the report!

1 Like

Will be fixed in the upcoming Prodigy v0.3.0! :tada:

1 Like

What is the underlying spacy bug? Is there a way around it?

I’ve stumbled on the same problem (exception stack), but without using Prodigy…

PS: I’m using pt_core_news_sm model