unable to run prodigy train command

rexis · April 2, 2023, 4:46pm

when I run the command
python -m prodigy train output -tc msg.cat -L -m zh_core_web_lg -c config.cfg
then it end up with

========================= Generating Prodigy config =========================
Using config from base model
Generated training config

=========================== Initializing pipeline ===========================
[2023-04-03 00:42:36,617] [DEBUG] Replacing listeners of component 'tagger'
[2023-04-03 00:42:40,830] [DEBUG] Replacing listeners of component 'parser'
[2023-04-03 00:42:49,143] [DEBUG] Replacing listeners of component 'parser'
[2023-04-03 00:42:49,274] [INFO] Set up nlp object from config
Components: textcat
Merging training and evaluation data for 1 components

[textcat] Training: 9 | Evaluation: 2 (20% split)
Training: 9 | Evaluation: 2
Labels: textcat (3)
[textcat] Attendance, Comment, Profile
[2023-04-03 00:42:49,315] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'ner', 'textcat']
[2023-04-03 00:42:49,316] [INFO] Resuming training for: ['tok2vec']
[2023-04-03 00:42:49,327] [INFO] Created vocabulary
[2023-04-03 00:42:51,550] [INFO] Added vectors: zh_core_web_lg
[2023-04-03 00:42:52,755] [INFO] Finished initializing nlp object
Traceback (most recent call last):
File "C:\Users\Admin\anaconda3\envs\telegram\lib\runpy.py", line 197, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\Admin\anaconda3\envs\telegram\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\prodigy_main.py", line 62, in
controller = recipe(args, use_plac=True)
File "cython_src\prodigy\core.pyx", line 374, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\plac_core.py", line 232, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\prodigy\recipes\train.py", line 289, in train
return train(
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\prodigy\recipes\train.py", line 201, in train
nlp = spacy_init_nlp(config, use_gpu=gpu_id)
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy\training\initialize.py", line 84, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy\language.py", line 1301, in initialize
self.tokenizer.initialize(get_examples, nlp=self, **tok_settings) # type: ignore[union-attr]
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy\lang\zh_init.py", line 88, in initialize
self.pkuseg_seg = try_pkuseg_import(
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy\lang\zh_init.py", line 322, in try_pkuseg_import
return spacy_pkuseg.pkuseg(pkuseg_model, user_dict=pkuseg_user_dict)
File "C:\Users\Admin\anaconda3\envs\telegram\lib\site-packages\spacy_pkuseg_init.py", line 234, in init
self.feature_extractor = FeatureExtractor.load()
File "spacy_pkuseg\feature_extractor.pyx", line 591, in spacy_pkuseg.feature_extractor.FeatureExtractor.load
File "C:\Users\Admin\anaconda3\envs\telegram\lib\ntpath.py", line 78, in join
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not NoneType

adriane · April 3, 2023, 6:58am

It looks like spacy's initialization+pkuseg defaults make it tricky to start from zh_core_web_lg as a base model.

As a workaround, use this in the [initialize] block in your config.cfg:

[initialize.tokenizer]
pkuseg_model = "spacy_ontonotes"
pkuseg_user_dict = "default"

This is the same pkuseg model as used in zh_core_web_lg.

rexis · April 5, 2023, 2:33pm

it work!. just wonder if I use language other than en then will encounter this issue?

adriane · April 5, 2023, 2:44pm

I think zh with pkuseg is the only place where you'll run into this particular problem with [initialize.tokenizer]. Most languages use the rule-based tokenizer and you can use spacy.copy_from_base_model.v1 to copy the tokenizer from a base model in the [initialize] block with the default settings otherwise.

We should probably consider changing the defaults for zh for this, or splitting the tokenizers into three separate tokenizers rather than trying to support them all in spacy.zh.ChineseTokenizer.

Topic		Replies	Views
`prodigy train` not reading configuration file usage , textcat , solved	2	645	March 10, 2022
Textcat - teach to train. usage , textcat	2	553	September 1, 2022
Error running textcat.batch-train if text is empty string textcat , done	16	1696	November 20, 2017
MemoryError when saving trained model textcat , solved	2	955	August 15, 2018
prodigy v1.8 start the web app from Python for recipe textcat.manual usage , textcat , solved	8	559	May 13, 2021

unable to run prodigy train command

Related topics