Textcat - teach to train.

I have created a data set and now want to train a binary classifier

export LABELS=just_one_label
prodigy textcat.teach ${DATA_SET_NAME} ${MODEL} ./${DATA_FILE} --label ${LABELS} --patterns ${PATTERNS_FILE}

Looking in the database inside meta, and deleting ignores I can see:

>> out['answer'].value_counts()
reject    640
accept    474

Then I run the train recipe.

prodigy train ${MODEL_OUTPUT_DIR} --textcat ds_name_1,ds_name_2

And I get the following stack trace due to the number of categories being less than 2.

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-09-01 12:02:23,865] [INFO] Set up nlp object from config
Components: textcat
Merging training and evaluation data for 1 components
  - [textcat] Training: 983 | Evaluation: 239 (20% split)
Training: 712 | Evaluation: 177
Labels: textcat (1)
[2022-09-01 12:02:24,585] [INFO] Pipeline: ['textcat']
[2022-09-01 12:02:24,588] [INFO] Created vocabulary
[2022-09-01 12:02:24,589] [INFO] Finished initializing nlp object
Traceback (most recent call last):
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 364, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/prodigy/recipes/train.py", line 278, in train
    return _train(
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/prodigy/recipes/train.py", line 190, in _train
    nlp = spacy_init_nlp(config, use_gpu=gpu_id)
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/spacy/training/initialize.py", line 84, in init_nlp
    nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/spacy/language.py", line 1317, in initialize
    proc.initialize(get_examples, nlp=self, **p_settings)
  File "/home/aia/.conda/envs/prodigy/lib/python3.9/site-packages/spacy/pipeline/textcat.py", line 379, in initialize
    raise ValueError(Errors.E867)
ValueError: [E867] The 'textcat' component requires at least two labels because it uses mutually exclusive classes where exactly one label is True for each doc. For binary classification tasks, you can use two labels with 'textcat' (LABEL / NOT_LABEL) or alternatively, you can use the 'textcat_multilabel' component with one label.

What am I missing here?

hi @jdewsnip!

Can you run --textcat-multilabel but not --textcat? This is a known challenge due to the design (namely translating spaCy into Prodigy). As the error message includes, we typically recommend using --textcat-multilabel instead.

Alternatively, if you do want to use --textcat, I've written a script to convert the labels to run with --textcat instead:

I can understand this is a bit confusing (why use textcat-multilabel for a binary classifier?). Ines describes the design balancing act:

I hope this helps and let us know if you have further questions!

ah got it ... if i use --textcat-multilabel for binary it works.