Unable to use train and run data-to-spacy recipes for spancat on prodigy 1.11.10

Hello. I am new to prodigy and have created a project for span categorization using:

prodigy spans.manual my_project en_core_sci_sm C:\Prodigy\Data\my_project.csv --loader csv --label RESPIRATORY,NEGATIVE

When running the train recipe:

prodigy train ./Models --spancat my_project --base-model en_core_sci_sm

I'm getting this error:

ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 1 to 11 (inferred from data)
ℹ Using config from base model
✔ Generated training config

=========================== Initializing pipeline ===========================
✘ Config validation error
Bad value substitution: option 'width' in section 'components.spancat.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'

I am also seeing a similar error when trying the data-to-spacy command:

python -m prodigy data-to-spacy .\output --spancat my_project --base-model en_core_sci_sm

> ======================== Generating cached label data ========================
> Traceback (most recent call last):
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\runpy.py", line 194, in _run_module_as_main
>     return _run_code(code, main_globals, None,
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\runpy.py", line 87, in _run_code
>     exec(code, run_globals)
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\site-packages\prodigy\__main__.py", line 62, in <module>
>     controller = recipe(*args, use_plac=True)
>   File "cython_src\prodigy\core.pyx", line 379, in prodigy.core.recipe.recipe_decorator.recipe_proxy
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\site-packages\plac_core.py", line 367, in call
>     cmd, result = parser.consume(arglist)
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\site-packages\plac_core.py", line 232, in consume
>     return cmd, self.func(*(args + varargs + extraopts), **kwargs)
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\site-packages\prodigy\recipes\train.py", line 514, in data_to_spacy
>     nlp = spacy_init_nlp(config)
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\site-packages\spacy\training\initialize.py", line 29, in init_nlp
>     config = raw_config.interpolate()
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\site-packages\confection\__init__.py", line 196, in interpolate
>     return Config().from_str(self.to_str())
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\site-packages\confection\__init__.py", line 387, in from_str
>     self.interpret_config(config)
>   File "C:\Users\bmosher\Anaconda3\envs\prodigy\lib\site-packages\confection\__init__.py", line 238, in interpret_config
>     raise ConfigValidationError(desc=f"{e}") from None
> confection.ConfigValidationError:
> 
> Config validation error
> Bad value substitution: option 'width' in section 'components.spancat.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}

I verified the my text examples have a max length of 500. I'm really at a loss for how to move forward.

Here is the output of conda list:

# packages in environment at C:\Users\bmosher\Anaconda3\envs\prodigy:
#
# Name                    Version                   Build  Channel
aiofiles                  23.1.0                   pypi_0    pypi
anyio                     3.5.0            py38haa95532_0  
appdirs                   1.4.4              pyhd3eb1b0_0  
argon2-cffi               21.3.0             pyhd3eb1b0_0  
argon2-cffi-bindings      21.2.0           py38h2bbff1b_0  
asttokens                 2.0.5              pyhd3eb1b0_0  
attrs                     22.1.0           py38haa95532_0  
backcall                  0.2.0              pyhd3eb1b0_0  
beautifulsoup4            4.11.1           py38haa95532_0  
blas                      1.0                         mkl  
bleach                    4.1.0              pyhd3eb1b0_0  
blis                      0.7.9                    pypi_0    pypi
brotlipy                  0.7.0           py38h2bbff1b_1003  
ca-certificates           2023.01.10           haa95532_0  
cachetools                5.3.0                    pypi_0    pypi
catalogue                 2.0.8                    pypi_0    pypi
certifi                   2022.12.7        py38haa95532_0  
cffi                      1.15.1           py38h2bbff1b_3  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
click                     8.1.3                    pypi_0    pypi
colorama                  0.4.6            py38haa95532_0  
comm                      0.1.2            py38haa95532_0  
confection                0.0.4                    pypi_0    pypi
conllu                    4.5.2                    pypi_0    pypi
cryptography              38.0.4           py38h21b164f_0  
cymem                     2.0.7                    pypi_0    pypi
cython                    0.29.28          py38hd77b12b_0  
debugpy                   1.5.1            py38hd77b12b_0  
decorator                 5.1.1              pyhd3eb1b0_0  
defusedxml                0.7.1              pyhd3eb1b0_0  
en-core-sci-sm            0.5.1                    pypi_0    pypi
entrypoints               0.4              py38haa95532_0  
executing                 0.8.3              pyhd3eb1b0_0  
fastapi                   0.89.1                   pypi_0    pypi
fftw                      3.3.9                h2bbff1b_1  
flit-core                 3.6.0              pyhd3eb1b0_0  
gensim                    4.2.0            py38hd77b12b_0  
h11                       0.14.0                   pypi_0    pypi
icc_rt                    2022.1.0             h6049295_2  
idna                      3.4              py38haa95532_0  
importlib_resources       5.2.0              pyhd3eb1b0_1  
intel-openmp              2021.4.0          haa95532_3556  
ipykernel                 6.19.2           py38hd4e2768_0  
ipython                   8.8.0            py38haa95532_0  
ipython_genutils          0.2.0              pyhd3eb1b0_1  
jedi                      0.18.1           py38haa95532_1  
jinja2                    3.1.2            py38haa95532_0  
joblib                    1.2.0                    pypi_0    pypi
jsonschema                4.16.0           py38haa95532_0  
jupyter_client            7.4.8            py38haa95532_0  
jupyter_core              5.1.1            py38haa95532_0  
jupyter_server            1.23.4           py38haa95532_0  
jupyterlab_pygments       0.1.2                      py_0  
langcodes                 3.3.0                    pypi_0    pypi
libffi                    3.4.2                hd77b12b_6  
libiconv                  1.16                 h2bbff1b_2  
libsodium                 1.0.18               h62dcd97_0  
libxml2                   2.9.14               h0ad7f3c_0  
libxslt                   1.1.35               h2bbff1b_0  
lxml                      4.9.1            py38h1985fb9_0  
markupsafe                2.1.1            py38h2bbff1b_0  
matplotlib-inline         0.1.6            py38haa95532_0  
mistune                   0.8.4           py38he774522_1000  
mkl                       2021.4.0           haa95532_640  
mkl-service               2.4.0            py38h2bbff1b_0  
mkl_fft                   1.3.1            py38h277e83a_0  
mkl_random                1.2.2            py38hf11a4ad_0  
murmurhash                1.0.9                    pypi_0    pypi
nbclassic                 0.4.8            py38haa95532_0  
nbclient                  0.5.13           py38haa95532_0  
nbconvert                 6.5.4            py38haa95532_0  
nbformat                  5.7.0            py38haa95532_0  
nest-asyncio              1.5.6            py38haa95532_0  
nmslib                    2.1.1                    pypi_0    pypi
notebook                  6.5.2            py38haa95532_0  
notebook-shim             0.2.2            py38haa95532_0  
numpy                     1.23.5           py38h3b20f71_0  
numpy-base                1.23.5           py38h4da318b_0  
openssl                   1.1.1s               h2bbff1b_0  
packaging                 22.0             py38haa95532_0  
pandocfilters             1.5.0              pyhd3eb1b0_0  
parso                     0.8.3              pyhd3eb1b0_0  
pathy                     0.10.1                   pypi_0    pypi
peewee                    3.15.4                   pypi_0    pypi
pickleshare               0.7.5           pyhd3eb1b0_1003  
pip                       22.3.1           py38haa95532_0  
pkgutil-resolve-name      1.3.10           py38haa95532_0  
plac                      1.1.3                    pypi_0    pypi
platformdirs              2.5.2            py38haa95532_0  
pooch                     1.4.0              pyhd3eb1b0_0  
preshed                   3.0.8                    pypi_0    pypi
prodigy                   1.11.10                  pypi_0    pypi
prometheus_client         0.14.1           py38haa95532_0  
prompt-toolkit            3.0.36           py38haa95532_0  
psutil                    5.9.0            py38h2bbff1b_0  
pure_eval                 0.2.2              pyhd3eb1b0_0  
pybind11                  2.6.1                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pydantic                  1.10.4                   pypi_0    pypi
pygments                  2.11.2             pyhd3eb1b0_0  
pyjwt                     2.6.0                    pypi_0    pypi
pyopenssl                 22.0.0             pyhd3eb1b0_0  
pyrsistent                0.18.0           py38h196d8e1_0  
pysbd                     0.3.4                    pypi_0    pypi
pysocks                   1.7.1            py38haa95532_0  
python                    3.8.16               h6244533_2  
python-dateutil           2.8.2              pyhd3eb1b0_0  
python-fastjsonschema     2.16.2           py38haa95532_0  
pywin32                   305              py38h2bbff1b_0  
pywinpty                  2.0.2            py38h5da7b33_0  
pyzmq                     23.2.0           py38hd77b12b_0  
requests                  2.28.1           py38haa95532_0  
scikit-learn              1.2.1                    pypi_0    pypi
scipy                     1.10.0           py38h321e85e_0  
scispacy                  0.5.1                    pypi_0    pypi
send2trash                1.8.0              pyhd3eb1b0_1  
setuptools                65.6.3           py38haa95532_0  
six                       1.16.0             pyhd3eb1b0_1  
smart_open                5.2.1            py38haa95532_0  
sniffio                   1.2.0            py38haa95532_1  
soupsieve                 2.3.2.post1      py38haa95532_0  
spacy                     3.4.4                    pypi_0    pypi
spacy-legacy              3.0.12                   pypi_0    pypi
spacy-loggers             1.0.4                    pypi_0    pypi
sqlite                    3.40.1               h2bbff1b_0  
srsly                     2.4.5                    pypi_0    pypi
stack_data                0.2.0              pyhd3eb1b0_0  
starlette                 0.22.0                   pypi_0    pypi
terminado                 0.17.1           py38haa95532_0  
thinc                     8.1.7                    pypi_0    pypi
threadpoolctl             3.1.0                    pypi_0    pypi
tinycss2                  1.2.1            py38haa95532_0  
toolz                     0.12.0                   pypi_0    pypi
tornado                   6.2              py38h2bbff1b_0  
tqdm                      4.64.1                   pypi_0    pypi
traitlets                 5.7.1            py38haa95532_0  
typer                     0.7.0                    pypi_0    pypi
typing-extensions         4.4.0            py38haa95532_0  
typing_extensions         4.4.0            py38haa95532_0  
urllib3                   1.26.14          py38haa95532_0  
uvicorn                   0.18.3                   pypi_0    pypi
vc                        14.2                 h21ff451_1  
vs2015_runtime            14.27.29016          h5e58377_2  
wasabi                    0.10.1                   pypi_0    pypi
wcwidth                   0.2.5              pyhd3eb1b0_0  
webencodings              0.5.1                    py38_1  
websocket-client          0.58.0           py38haa95532_4  
wheel                     0.37.1             pyhd3eb1b0_0  
win_inet_pton             1.1.0            py38haa95532_0  
wincertstore              0.2              py38haa95532_2  
winpty                    0.4.3                         4  
zeromq                    4.3.4                hd77b12b_0  
zipp                      3.11.0           py38haa95532_0  
zlib                      1.2.13               h8cc25b3_0

I am grateful for any hints or suggestions.

Cheers,

Bryan Mosher

hi @bmosher01!

Thanks for your question and welcome to the Prodigy community :wave:

First off -- thank you so much for your detailed issue. This helps us so much and we greatly appreciate (and can respond much faster) when users provide good details of their issue.

Do you have the same problem if you remove the --base-model? Either when annotating (e.g., in training or converting the data with data-to-spacy)?

We've recently found some potential issues with the --base-model with prodigy train, but maybe it also affectsdata-to-spacy too.

Just curious, can you explain your thinking of using the en_core_sci_sm model (SciSpaCy)?

Typically base models are used when you want to use those vectors in a future pipeline, so I could see if using SciSpaCy in data-to-spacy if you wanted your pipeline to have SciSpaCy's vectors during training. (I guess in theory, you could also use the sole vector models like en_core_sci_lg instead).

I could also see SciSpaCy helping if you wanted to use a correct or teach model that you were trying to use one of its' components (say a custom ner) and correct/teach it in Prodigy. However, for spans, you likely may be just as well okay with a blank tokenizer.

prodigy spans.manual my_project en_core_sci_sm C:\Prodigy\Data\my_project.csv --loader csv --label RESPIRATORY,NEGATIVE

Also for annotating manual recipes, you essentially could use any English tokenizer (e.g., blank:en). But I don't think the annotations are the problem. It's training or running data-to-spacy.

I'll admit I haven't used SciSpaCy before so I'll need to look more into it.

One last thing - I see you're running spaCy 3.4.4. Do you know if SciSpaCy 0.5.0 works for spaCy 3.4.4? just know sometimes it's hard to keep up with newer versions of spaCy, for example:

Regardless, let us know if you can at least overcome this bottleneck.

Thanks so much for the response.

Running the data-to-spacy recipe without --base-model will finish without error.

prodigy data-to-spacy .\output --spancat my_project

Interestingly, using --base-model en_core_web_sm fails in the same way as en_core_sci_sm.

The most recent version scispacy (0.5.1) is compatible with spacy 3.4.4. I believe this model should work here. However, I'm not able to get it working with either en_core_sci_sm or en_core_web_sm.

The thinking in using the en_core_sci_sm model is to accelerate labeling. I'm dealing with domain specific (healthcare) which has many abbreviations and unique words. An example span might be: "PT ON 3.5L NC"

I need to label my dataset from scratch and my fear is that it will be a monster without transfer learning.

In the end I am able to get the data-to-spacy command and then train a spancat model using the "spacy train" instead of "prodigy train".

I think the piece I still need to figure out is how to get the en_core_web_sm integrated or get the supply the vectors as you mentioned above.

Thank you again!

Bryan

Thank you!

Ah - yeah. It looks like it may be a bug but only for --spancat.

If I use this dataset:
annotated_news_headlines.jsonl (252.9 KB)

And run (as well as en_core_web_md and en_core_web_lg which got the same error).

(venv) $ python -m prodigy db-in news_data annotated_news_headlines.jsonl
✔ Created dataset 'news_data' in database SQLite
✔ Imported 373 annotations to 'news_data' (session 2023-02-10_16-39-16)
in database SQLite
Found and keeping existing "answer" in 373 examples

(venv) $ python -m prodigy data-to-spacy model --spancat news_data --base-model en_core_web_sm
ℹ Using base model 'en_core_web_sm'

============================== Generating data ==============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 298 | Evaluation: 74 (20% split)
Training: 298 | Evaluation: 74
Labels: spancat (4)
✔ Saved 298 training examples
models/d2s/train.spacy
✔ Saved 74 evaluation examples
models/d2s/dev.spacy

============================= Generating config =============================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 1 to 3 (inferred from data)
ℹ Using config from base model
✔ Generated training config

======================== Generating cached label data ========================
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/homebrew/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/homebrew/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 379, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/opt/homebrew/lib/python3.10/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/opt/homebrew/lib/python3.10/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/prodigy/recipes/train.py", line 514, in data_to_spacy
    nlp = spacy_init_nlp(config)
  File "/opt/homebrew/lib/python3.10/site-packages/spacy/training/initialize.py", line 29, in init_nlp
    config = raw_config.interpolate()
  File "/opt/homebrew/lib/python3.10/site-packages/confection/__init__.py", line 196, in interpolate
    return Config().from_str(self.to_str())
  File "/opt/homebrew/lib/python3.10/site-packages/confection/__init__.py", line 387, in from_str
    self.interpret_config(config)
  File "/opt/homebrew/lib/python3.10/site-packages/confection/__init__.py", line 238, in interpret_config
    raise ConfigValidationError(desc=f"{e}") from None
confection.ConfigValidationError: 

Config validation error
Bad value substitution: option 'width' in section 'components.spancat.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'

But what's interesting is that using the same data for --ner works fine, even for en_core_web_sm.

(venv) $ python -m prodigy data-to-spacy models/d2s_sm --ner news_data --base-model en_core_web_sm
✔ Created output directory
ℹ Using base model 'en_core_web_sm'

============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 298 | Evaluation: 74 (20% split)
Training: 298 | Evaluation: 74
Labels: ner (4)
✔ Saved 298 training examples
models/d2s_sm/train.spacy
✔ Saved 74 evaluation examples
models/d2s_sm/dev.spacy

============================= Generating config =============================
ℹ Auto-generating config with spaCy
ℹ Using config from base model
✔ Generated training config

======================== Generating cached label data ========================
✔ Saving label data for component 'tagger'
models/d2s_sm/labels/tagger.json
✔ Saving label data for component 'parser'
models/d2s_sm/labels/parser.json
✔ Saving label data for component 'ner'
models/d2s_sm/labels/ner.json

============================= Finalizing export =============================
✔ Saved training config
models/d2s_sm/config.cfg

To use this data for training with spaCy, you can run:
python -m spacy train models/d2s_sm/config.cfg --paths.train models/d2s_sm/train.spacy --paths.dev models/d2s_sm/dev.spacy

By the way, trying --ner may not work for you because your data was originally annotated with a spans recipe. This data was produced by ner recipes (so technically ner annotations) but it can typically work in --spancat training but the opposite doesn't hold (spans annotations can't be trained as an ner component). One reason is spans recipes may produce overlapping entities, which a ner component can't train.

I agree. I'd recommend you move into spacy train.

Prodigy's prodigy train is just a wrapper of spacy train with sensible defaults.

I recently used a template project to compare the differences:

But it doesn't take advantage of one of spaCy's strengths: its custom configuration. I would start here in the spaCy docs and choose your options. Then build a config and train.

Also, if you have specific spaCy config questions, check out spaCy's GitHub Discussions forum. Lots of great posts and the spaCy core team can help answer questions.

In the meantime, next week I'm going to investigate more the data-to-spacy for --spancat. I'll let you know what we can figure out. Thanks for reporting the issue!

@bmosher01 sorry for the delay. We fixed this bug in our recent release of Prodigy v1.11.12. Let us know if you have any further issues.