Could not find function 'specialized_ner_reader' in function registry 'readers'.

Dear Prodigy Support,

I recently got the Prodigy Nightly plan and I wanted to try to use the new features of data-to-spacy (see e.g. here) in order to generate a config.cfg file to use with the new spacy 3.0.

Context

I would like to use the base model en_ner_craft_md from scispacy to train a NER model. Also, in my environment I have installed

  • spacy==3.0.5
  • scispacy==0.4.0
  • en_ner_craft_md==0.4.0
  • prodigy==1.11.0a5.

Then I run

prodigy data-to-spacy \
    --lang en \
    --ner annotations15_EmmanuelleLogette_2020-09-22_raw9_Pathway \
    --eval-split 0.1 \
    --base-model en_ner_craft_md \
    --optimize accuracy \
    --verbose tmp

and I got the following output.

ℹ Using base model 'en_ner_craft_md'

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 134 | Evaluation: 14 (10% split)
Training: 134 | Evaluation: 14
Labels: ner (1)
  - [ner] PATHWAY
/usr/local/lib/python3.7/dist-packages/spacy/training/iob_utils.py:142: UserWarning: [W030] Some entities could not be aligned in the text "Electrochemical potential-driven transporters (cla..." with entities "[(187, 196, 'PATHWAY'), (331, 344, 'PATHWAY'), (61...". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
  entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str,
✔ Saved 134 training examples
tmp/train.spacy
✔ Saved 14 evaluation examples
tmp/dev.spacy

============================= Generating config =============================
ℹ Using config from base model
✔ Generated training config

======================== Generating cached label data ========================
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/casalegn/.local/lib/python3.7/site-packages/prodigy/__main__.py", line 54, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 505, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/casalegn/.local/lib/python3.7/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/casalegn/.local/lib/python3.7/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/casalegn/.local/lib/python3.7/site-packages/prodigy/recipes/train.py", line 435, in data_to_spacy
    nlp = spacy_init_nlp(config, use_gpu=0 if gpu else -1)  # ID doesn't matter
  File "/usr/local/lib/python3.7/dist-packages/spacy/training/initialize.py", line 57, in init_nlp
    train_corpus, dev_corpus = resolve_dot_names(config, dot_names)
  File "/usr/local/lib/python3.7/dist-packages/spacy/util.py", line 474, in resolve_dot_names
    result = registry.resolve(config[section])
  File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 723, in resolve
    config, schema=schema, overrides=overrides, validate=validate, resolve=True
  File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 772, in _make
    config, schema, validate=validate, overrides=overrides, resolve=resolve
  File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 825, in _fill
    promise_schema = cls.make_promise_schema(value, resolve=resolve)
  File "/usr/local/lib/python3.7/dist-packages/thinc/config.py", line 1016, in make_promise_schema
    func = cls.get(reg_name, func_name)
  File "/usr/local/lib/python3.7/dist-packages/spacy/util.py", line 141, in get
    ) from None
catalogue.RegistryError: [E893] Could not find function 'specialized_ner_reader' in function registry 'readers'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: prodigy.MergedCorpus.v1, prodigy.NERCorpus.v1, prodigy.ParserCorpus.v1, prodigy.TaggerCorpus.v1, prodigy.TextCatCorpus.v1, spacy.Corpus.v1, spacy.JsonlCorpus.v1, spacy.read_labels.v1, srsly.read_json.v1, srsly.read_jsonl.v1, srsly.read_msgpack.v1, srsly.read_yaml.v1

Questions

  1. Could you please help me debugging the error above?
  2. I was expecting this command to produce a config.cfg file. And since the command printed
    ✔ Generated training config
    I expected to find this configuration file in the ouput directory. However I could not find such a file, do you know why?
    Maybe the message
    ℹ Using config from base model
    means that no config.cfg is going to be generated, and that I can just use the file /usr/local/lib/python3.7/dist-packages/en_ner_craft_md/en_ner_craft_md-0.4.0/config.cfg?

Thank you very much in advance for your kind help!

—Francesco

As far as I understand, the specialized_ner_reader is the one defined in scispacy (see here), and I have scispacy installed so I don't understand why prodigy doesn't seem to be able to find it...

I think the reason it wasn't saved is that the command failed: after generating the config, there are still a few steps that are performed to add the label data etc., which is also added to the config. So the config is only saved at the end. If something goes wrong before, no config can be saved.

If your goal is to train a scispaCy model from scratch wit the same configuration as this one, then yes. However, if you want to keep some weights from one or more existing trained pipelines, only train some components that you have data for (and freeze others) etc., you'll need to define this in your config: Training Pipelines & Models · spaCy Usage Documentation

The data-to-spacy command will auto-generate that for you based on the base model and the components you have data for.

Are you sure the right version of scispaCy is installed and available in the same environment? It looks like the scispaCy functions are not exposed to spaCy's function registry, and none of them show up in the list of available names either.

Dear @ines,

Thank you very much for your reply.

Regarding specialized_ner_reader:

One could have training and development corpora in a format different from the one read by the scispaCy custom reader.

Would it be possible to tell prodigy to generate a config.cfg with another reader, for example the default reader spacy.Corpus.v1, for [corpora.xyz]?

This isn't a solution to the reported issue per se but it could be a useful workaround.

Thanks.

Thank you @ines for your quick reply!

Just to clarify, that's not what I'm trying to do—I am instead trying to train a NER model by fine-tuning a base scispaCy model.

I double checked, and scispaCy as well as the specific scispaCy models are indeed installed. Could it be that the problem is due to the fact that scispaCy people missed the --code when calling spacy package?

If that's the issue, is there any way to pass a file with code and functions to be registered when we call prodigy data-to-spacy, so that I could pass the file with the definition of specialized_ner_reader when I call prodigy data-to-spacy?

OK, so I tried to dig a bit deeper. Even if spacy does have access to specialized_ner_reader, the problem is that when

nlp = spacy_init_nlp(config, use_gpu=0 if gpu else -1)  # ID doesn't matter

is called in train.py at line 435 (inside of data-to-spacy function definition), then the model tries to read all the train.tsv and test.tsv annotated corpora that were used by scispaCy people when they trained their models.

Unfortunately we cannot get access those .tsv files. But in the end that's not what we should need anyway, since we don't want to train the NER model on that corpus but on our our own corpus, just using the scispaCy model's pretrained weights to make the training easier.

Do you have any suggestions on how we could achieve this?

Thank you in advance!

Thanks for digging into this! The config for the model includes the exact training and initialization settings used to produce the artifact (which is useful because it tells you exactly how it was created), but it also means that you can't necessarily re-run the config if you don't have the same code and data. I guess this is one of the edge cases where Prodigy can't easily be clever and generate a working config from a base model, because it can't know what's in there.

But since you're not training from scratch, you probably just want to source the components from the scispaCy model in your new config, and it's probably easier to just do it by hand.

Sure, you can just replace that block with any other reader – either the default one, or a custom function if you're working with a different format.

See a similar example of how to do this here: projects/create_config.py at 9d5fce5f95ddf5f35c3370b2074b25e995525f51 · explosion/projects · GitHub

This is extending ner from en_core_web_sm but replacing most of the settings with the default config settings to avoid custom functions in the init.

Also see the discussion here about copying the vocab and tokenizer settings as part of the model in the new config: Best practice to keep the custom tokenizer and vocabulary when resuming training · Discussion #7582 · explosion/spaCy · GitHub

In the script above, for compatibility with models where the component listens to a separate tok2vec component, there should also be a replace_listeners setting added in the config, but this requires a bugfix for spacy that hasn't been released yet.

1 Like