Custom vectors loading issue

Hi,

I'm running into an issue when I try to load custom FastText vectors into my model. These are the steps:

  1. Load the base en_core_web_sm model.
  2. Use spacy.cli.init_model.add_vectors to load FastText vectors (stored as .vec.gz) to the model.
  3. Disable all pipelines except NER (tagger & parser in particular).
  4. Train NER.
  5. End training, and restore pipelines.
  6. Get a final evaluation score with nlp.evaluate().
  7. Save to disc with nlp.to_disk

The all works fine. However, later when I try to re-load the model from disk, I get this error:

Traceback (most recent call last):
  File "nn_parser.pyx", line 671, in spacy.syntax.nn_parser.Parser.from_disk
  File "/opt/venv/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 375, in from_bytes
    dest = getattr(layer, name)
AttributeError: 'FunctionLayer' object has no attribute 'vectors'

...

  File "/opt/venv/lib/python3.7/site-packages/spacy/language.py", line 936, in <lambda>
    p, exclude=["vocab"]
  File "nn_parser.pyx", line 673, in spacy.syntax.nn_parser.Parser.from_disk
ValueError: [E149] Error deserializing model. Check that the config used to create the component matches the model being loaded.

I'm guessing that loading in the parser/tagger pipelines is causing this because somehow it's expecting vectors to exist where they don't. My meta.json file includes the following vector info:

  "vectors": {
    "width": 300,
    "vectors": 766082,
    "keys": 766082,
    "name": "en_model.vectors"
  },

While my parser cfg has the following:

{
  "beam_width":1,
  "beam_density":0.0,
  "beam_update_prob":1.0,
  "cnn_maxout_pieces":3,
  "nr_feature_tokens":8,
  "deprecation_fixes":{
    "vectors_name":null
  },
  "learn_tokens":false,
  "nr_class":107,
  "hidden_depth":1,
  "token_vector_width":96,
  "hidden_width":64,
  "maxout_pieces":2,
  "pretrained_vectors":null,
  "bilstm_depth":0,
  "self_attn_depth":0,
  "conv_depth":4,
  "conv_window":4,
  "embed_size":2000
}

I'm using version 2.2.3. The whole process (load, train, save, load) does work if I do not add any vectors, so it's not a version mismatch issue.

Any idea what's causing this?

This is an area of spaCy we're eager to improve (and we have something we're very keen to launch soon!). The general problem is that the system of passing config through the different components is very brittle. Defaults can be inserted at various points along the path, and this leads to lots of bugs.

The specific type of bug here is that the parser has ended up expecting vectors, I guess because there are vectors loaded onto the NLP object. There's no conceptual reason why you shouldn't have vectors in the NER and no vectors in the parser --- it's just that the config setting is being passed incorrectly.

It looks to me like the culprit is the _fix_pretrained_vectors_name function in spacy.language. This function was added to correct a previous error with the vector naming without forcing model redownloads. I think it's now causing the problem.

You might be able to simply monkey-patch the function out, like this:

import spacy.language
# Undo this backwards compatibility hack, as it interferes with having
# some components use vectors but not others.
spacy.language._fix_pretrained_vectors_name = lambda nlp: nlp

That took care of it, thank you. Excited to see what's being launched soon!