Unable to use train and run data-to-spacy recipes for spancat on prodigy 1.11.10

Thank you!

Ah - yeah. It looks like it may be a bug but only for --spancat.

If I use this dataset:
annotated_news_headlines.jsonl (252.9 KB)

And run (as well as en_core_web_md and en_core_web_lg which got the same error).

(venv) $ python -m prodigy db-in news_data annotated_news_headlines.jsonl
✔ Created dataset 'news_data' in database SQLite
✔ Imported 373 annotations to 'news_data' (session 2023-02-10_16-39-16)
in database SQLite
Found and keeping existing "answer" in 373 examples

(venv) $ python -m prodigy data-to-spacy model --spancat news_data --base-model en_core_web_sm
ℹ Using base model 'en_core_web_sm'

============================== Generating data ==============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 298 | Evaluation: 74 (20% split)
Training: 298 | Evaluation: 74
Labels: spancat (4)
✔ Saved 298 training examples
models/d2s/train.spacy
✔ Saved 74 evaluation examples
models/d2s/dev.spacy

============================= Generating config =============================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 1 to 3 (inferred from data)
ℹ Using config from base model
✔ Generated training config

======================== Generating cached label data ========================
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/homebrew/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/homebrew/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 379, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/opt/homebrew/lib/python3.10/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/opt/homebrew/lib/python3.10/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/prodigy/recipes/train.py", line 514, in data_to_spacy
    nlp = spacy_init_nlp(config)
  File "/opt/homebrew/lib/python3.10/site-packages/spacy/training/initialize.py", line 29, in init_nlp
    config = raw_config.interpolate()
  File "/opt/homebrew/lib/python3.10/site-packages/confection/__init__.py", line 196, in interpolate
    return Config().from_str(self.to_str())
  File "/opt/homebrew/lib/python3.10/site-packages/confection/__init__.py", line 387, in from_str
    self.interpret_config(config)
  File "/opt/homebrew/lib/python3.10/site-packages/confection/__init__.py", line 238, in interpret_config
    raise ConfigValidationError(desc=f"{e}") from None
confection.ConfigValidationError: 

Config validation error
Bad value substitution: option 'width' in section 'components.spancat.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'

But what's interesting is that using the same data for --ner works fine, even for en_core_web_sm.

(venv) $ python -m prodigy data-to-spacy models/d2s_sm --ner news_data --base-model en_core_web_sm
✔ Created output directory
ℹ Using base model 'en_core_web_sm'

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 298 | Evaluation: 74 (20% split)
Training: 298 | Evaluation: 74
Labels: ner (4)
✔ Saved 298 training examples
models/d2s_sm/train.spacy
✔ Saved 74 evaluation examples
models/d2s_sm/dev.spacy

============================= Generating config =============================
ℹ Auto-generating config with spaCy
ℹ Using config from base model
✔ Generated training config

======================== Generating cached label data ========================
✔ Saving label data for component 'tagger'
models/d2s_sm/labels/tagger.json
✔ Saving label data for component 'parser'
models/d2s_sm/labels/parser.json
✔ Saving label data for component 'ner'
models/d2s_sm/labels/ner.json

============================= Finalizing export =============================
✔ Saved training config
models/d2s_sm/config.cfg

To use this data for training with spaCy, you can run:
python -m spacy train models/d2s_sm/config.cfg --paths.train models/d2s_sm/train.spacy --paths.dev models/d2s_sm/dev.spacy

By the way, trying --ner may not work for you because your data was originally annotated with a spans recipe. This data was produced by ner recipes (so technically ner annotations) but it can typically work in --spancat training but the opposite doesn't hold (spans annotations can't be trained as an ner component). One reason is spans recipes may produce overlapping entities, which a ner component can't train.

I agree. I'd recommend you move into spacy train.

Prodigy's prodigy train is just a wrapper of spacy train with sensible defaults.

I recently used a template project to compare the differences:

But it doesn't take advantage of one of spaCy's strengths: its custom configuration. I would start here in the spaCy docs and choose your options. Then build a config and train.

Also, if you have specific spaCy config questions, check out spaCy's GitHub Discussions forum. Lots of great posts and the spaCy core team can help answer questions.

In the meantime, next week I'm going to investigate more the data-to-spacy for --spancat. I'll let you know what we can figure out. Thanks for reporting the issue!