Thank you!
Ah - yeah. It looks like it may be a bug but only for --spancat
.
If I use this dataset:
annotated_news_headlines.jsonl (252.9 KB)
And run (as well as en_core_web_md
and en_core_web_lg
which got the same error).
(venv) $ python -m prodigy db-in news_data annotated_news_headlines.jsonl
✔ Created dataset 'news_data' in database SQLite
✔ Imported 373 annotations to 'news_data' (session 2023-02-10_16-39-16)
in database SQLite
Found and keeping existing "answer" in 373 examples
(venv) $ python -m prodigy data-to-spacy model --spancat news_data --base-model en_core_web_sm
ℹ Using base model 'en_core_web_sm'
============================== Generating data ==============================
Components: spancat
Merging training and evaluation data for 1 components
- [spancat] Training: 298 | Evaluation: 74 (20% split)
Training: 298 | Evaluation: 74
Labels: spancat (4)
✔ Saved 298 training examples
models/d2s/train.spacy
✔ Saved 74 evaluation examples
models/d2s/dev.spacy
============================= Generating config =============================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 1 to 3 (inferred from data)
ℹ Using config from base model
✔ Generated training config
======================== Generating cached label data ========================
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/homebrew/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/homebrew/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 379, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/opt/homebrew/lib/python3.10/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/opt/homebrew/lib/python3.10/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/prodigy/recipes/train.py", line 514, in data_to_spacy
nlp = spacy_init_nlp(config)
File "/opt/homebrew/lib/python3.10/site-packages/spacy/training/initialize.py", line 29, in init_nlp
config = raw_config.interpolate()
File "/opt/homebrew/lib/python3.10/site-packages/confection/__init__.py", line 196, in interpolate
return Config().from_str(self.to_str())
File "/opt/homebrew/lib/python3.10/site-packages/confection/__init__.py", line 387, in from_str
self.interpret_config(config)
File "/opt/homebrew/lib/python3.10/site-packages/confection/__init__.py", line 238, in interpret_config
raise ConfigValidationError(desc=f"{e}") from None
confection.ConfigValidationError:
Config validation error
Bad value substitution: option 'width' in section 'components.spancat.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'
But what's interesting is that using the same data for --ner
works fine, even for en_core_web_sm
.
(venv) $ python -m prodigy data-to-spacy models/d2s_sm --ner news_data --base-model en_core_web_sm
✔ Created output directory
ℹ Using base model 'en_core_web_sm'
============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 298 | Evaluation: 74 (20% split)
Training: 298 | Evaluation: 74
Labels: ner (4)
✔ Saved 298 training examples
models/d2s_sm/train.spacy
✔ Saved 74 evaluation examples
models/d2s_sm/dev.spacy
============================= Generating config =============================
ℹ Auto-generating config with spaCy
ℹ Using config from base model
✔ Generated training config
======================== Generating cached label data ========================
✔ Saving label data for component 'tagger'
models/d2s_sm/labels/tagger.json
✔ Saving label data for component 'parser'
models/d2s_sm/labels/parser.json
✔ Saving label data for component 'ner'
models/d2s_sm/labels/ner.json
============================= Finalizing export =============================
✔ Saved training config
models/d2s_sm/config.cfg
To use this data for training with spaCy, you can run:
python -m spacy train models/d2s_sm/config.cfg --paths.train models/d2s_sm/train.spacy --paths.dev models/d2s_sm/dev.spacy
By the way, trying --ner
may not work for you because your data was originally annotated with a spans recipe. This data was produced by ner
recipes (so technically ner
annotations) but it can typically work in --spancat
training but the opposite doesn't hold (spans
annotations can't be trained as an ner
component). One reason is spans
recipes may produce overlapping entities, which a ner
component can't train.
I agree. I'd recommend you move into spacy train
.
Prodigy's prodigy train
is just a wrapper of spacy train
with sensible defaults.
I recently used a template project to compare the differences:
But it doesn't take advantage of one of spaCy's strengths: its custom configuration. I would start here in the spaCy docs and choose your options. Then build a config and train.
Also, if you have specific spaCy config questions, check out spaCy's GitHub Discussions forum. Lots of great posts and the spaCy core team can help answer questions.
In the meantime, next week I'm going to investigate more the data-to-spacy
for --spancat
. I'll let you know what we can figure out. Thanks for reporting the issue!