I have created a data set and now want to train a binary classifier
export LABELS=just_one_label
prodigy textcat.teach ${DATA_SET_NAME} ${MODEL} ./${DATA_FILE} --label ${LABELS} --patterns ${PATTERNS_FILE}
Looking in the database inside meta, and deleting ignores I can see:
>> out['answer'].value_counts()
reject 640
accept 474
Then I run the train recipe.
prodigy train ${MODEL_OUTPUT_DIR} --textcat ds_name_1,ds_name_2
And I get the following stack trace due to the number of categories being less than 2.
========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config
=========================== Initializing pipeline ===========================
[2022-09-01 12:02:23,865] [INFO] Set up nlp object from config
Components: textcat
Merging training and evaluation data for 1 components
- [textcat] Training: 983 | Evaluation: 239 (20% split)
Training: 712 | Evaluation: 177
Labels: textcat (1)
[2022-09-01 12:02:24,585] [INFO] Pipeline: ['textcat']
[2022-09-01 12:02:24,588] [INFO] Created vocabulary
[2022-09-01 12:02:24,589] [INFO] Finished initializing nlp object
Traceback (most recent call last):
File "/home/aia/.conda/envs/prodigy/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/aia/.conda/envs/prodigy/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/prodigy/__main__.py", line 61, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 364, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/prodigy/recipes/train.py", line 278, in train
return _train(
File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/prodigy/recipes/train.py", line 190, in _train
nlp = spacy_init_nlp(config, use_gpu=gpu_id)
File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/spacy/training/initialize.py", line 84, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/spacy/language.py", line 1317, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/spacy/pipeline/textcat.py", line 379, in initialize
raise ValueError(Errors.E867)
ValueError: [E867] The 'textcat' component requires at least two labels because it uses mutually exclusive classes where exactly one label is True for each doc. For binary classification tasks, you can use two labels with 'textcat' (LABEL / NOT_LABEL) or alternatively, you can use the 'textcat_multilabel' component with one label.
What am I missing here?