unable to run prodigy train command

when I run the command
python -m prodigy train output -tc msg.cat -L -m zh_core_web_lg -c config.cfg
then it end up with

========================= Generating Prodigy config =========================
:information_source: Using config from base model
:heavy_check_mark: Generated training config

=========================== Initializing pipeline ===========================
[2023-04-03 00:42:36,617] [DEBUG] Replacing listeners of component 'tagger'
[2023-04-03 00:42:40,830] [DEBUG] Replacing listeners of component 'parser'
[2023-04-03 00:42:49,143] [DEBUG] Replacing listeners of component 'parser'
[2023-04-03 00:42:49,274] [INFO] Set up nlp object from config
Components: textcat
Merging training and evaluation data for 1 components

  • [textcat] Training: 9 | Evaluation: 2 (20% split)
    Training: 9 | Evaluation: 2
    Labels: textcat (3)
  • [textcat] Attendance, Comment, Profile
    [2023-04-03 00:42:49,315] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'ner', 'textcat']
    [2023-04-03 00:42:49,316] [INFO] Resuming training for: ['tok2vec']
    [2023-04-03 00:42:49,327] [INFO] Created vocabulary
    [2023-04-03 00:42:51,550] [INFO] Added vectors: zh_core_web_lg
    [2023-04-03 00:42:52,755] [INFO] Finished initializing nlp object
    Traceback (most recent call last):
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\runpy.py", line 197, in _run_module_as_main
    return run_code(code, main_globals, None,
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\runpy.py", line 87, in run_code
    exec(code, run_globals)
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\prodigy_main
    .py", line 62, in
    controller = recipe(args, use_plac=True)
    File "cython_src\prodigy\core.pyx", line 374, in prodigy.core.recipe.recipe_decorator.recipe_proxy
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\plac_core.py", line 232, in consume
    return cmd, self.func(
    (args + varargs + extraopts), **kwargs)
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\prodigy\recipes\train.py", line 289, in train
    return train(
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\prodigy\recipes\train.py", line 201, in train
    nlp = spacy_init_nlp(config, use_gpu=gpu_id)
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy\training\initialize.py", line 84, in init_nlp
    nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy\language.py", line 1301, in initialize
    self.tokenizer.initialize(get_examples, nlp=self, **tok_settings) # type: ignore[union-attr]
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy\lang\zh_init
    .py", line 88, in initialize
    self.pkuseg_seg = try_pkuseg_import(
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy\lang\zh_init
    .py", line 322, in try_pkuseg_import
    return spacy_pkuseg.pkuseg(pkuseg_model, user_dict=pkuseg_user_dict)
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy_pkuseg_init
    .py", line 234, in init
    self.feature_extractor = FeatureExtractor.load()
    File "spacy_pkuseg\feature_extractor.pyx", line 591, in spacy_pkuseg.feature_extractor.FeatureExtractor.load
    File "C:\Users\Admin\anaconda3\envs\telegram\lib\ntpath.py", line 78, in join
    path = os.fspath(path)
    TypeError: expected str, bytes or os.PathLike object, not NoneType

It looks like spacy's initialization+pkuseg defaults make it tricky to start from zh_core_web_lg as a base model.

As a workaround, use this in the [initialize] block in your config.cfg:

pkuseg_model = "spacy_ontonotes"
pkuseg_user_dict = "default"

This is the same pkuseg model as used in zh_core_web_lg.

1 Like

it work!. just wonder if I use language other than en then will encounter this issue?

I think zh with pkuseg is the only place where you'll run into this particular problem with [initialize.tokenizer]. Most languages use the rule-based tokenizer and you can use spacy.copy_from_base_model.v1 to copy the tokenizer from a base model in the [initialize] block with the default settings otherwise.

We should probably consider changing the defaults for zh for this, or splitting the tokenizers into three separate tokenizers rather than trying to support them all in spacy.zh.ChineseTokenizer.