Converting outside annotations in proper format for textcat_muultilabel in spacy v3

I've posted this on stack overflow ( with no success yet, so I'm trying here:

I'm trying to convert data from a csv into a DocBin to train a model with a textcat_multilabel component.
doc.cats, before being added to the DocBin and serialized looks like this:

{'Santé': 1, 'Économie': 0, 'Infrastructure': 0, 'Politique fédérale': 0, 'Politique provinciale': 1, 'Politique municipale': 0, 'Éducation': 0, 'Faits divers': 0, 'Culture': 0}

Full error message when running the spacy train CLI command:

ℹ Using CPU

=========================== Initializing pipeline ===========================
[2021-08-18 06:09:46,242] [INFO] Set up nlp object from config
[2021-08-18 06:09:46,259] [INFO] Pipeline: ['tok2vec', 'textcat_multilabel', 'ner', 'parser']
[2021-08-18 06:09:46,266] [INFO] Created vocabulary
[2021-08-18 06:09:50,649] [INFO] Added vectors: fr_core_news_lg
[2021-08-18 06:09:56,557] [INFO] Finished initializing nlp object
[2021-08-18 06:10:07,714] [INFO] Initialized pipeline components: ['tok2vec', 'textcat_multilabel', 'ner', 'parser']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'textcat_multilabel', 'ner', 'parser']
ℹ Initial learn rate: 0.001
---  ------  ------------  -------------  --------  -----------  ----------  ------  ------  ------  -------  -------  -------  ------
Traceback (most recent call last):
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/", line 281, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/", line 1389, in evaluate
    results = scorer.score(examples)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/", line 135, in score
    scores.update(component.score(examples, **self.cfg))
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/pipeline/", line 179, in score
    return Scorer.score_cats(
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/", line 465, in score_cats
    auc_per_type[label].score_set(pred_score, gold_score)
KeyError: 'Politique fédérale'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/", line 4, in <module>
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/cli/", line 69, in setup_cli
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/typer/", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/cli/", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/", line 122, in train
    raise e
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/", line 105, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/", line 226, in train_while_improving
    score, other_scores = evaluate()
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/", line 283, in evaluate
    raise KeyError(Errors.E900.format(pipeline=nlp.pipe_names)) from e
KeyError: "[E900] Could not run the full pipeline for evaluation. If you specified frozen components, make sure they were already initialized and trained. Full pipeline: ['tok2vec', 'textcat_multilabel', 'ner', 'parser']"

Output when running spacy debug data CLI command:

====================== Text Classification (Multilabel) ======================
ℹ Text Classification: 30 label(s)
⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training: 'v', 'F', 'm', 'f',
'É', 'l', 'c', 'q', 'o', ']', 'u', 'I', 'P', 'r', 'a', 'D', 'é', 'S', 't', ',',
'M', ' ', 's', ''', 'd', 'i', 'p', 'e', 'n', '['.

Looking at the list of labels in that output, I'm pretty sure my problem is with the way I'm formatting my dictionary used to set doc.cats but I can't seem to find the proper way to format it. I'm sure it's somewhere in the documentation, but I can't seem to find it and feel a bit silly...

I also tried by formatting to the Prodigy .jsonl format, and using the db-in then data-to-spacy order of commands, but I end up with the same problem. Sample line from the .jsonl file:

{"text":"Fran\u00e7ois Legault doit tenir une autre conf\u00e9rence de presse ce soir \u00e0 17 h","options":[{"id":"Sant\u00e9","text":"Sant\u00e9"},{"id":"Politique_Provinciale","text":"Politique_Provinciale"}],"_view_id":"choice","config":{"choice_style":"multiple"},"accept":["Politique_Provinciale"],"answer":"accept"}

Any help would be amazing

Are you sure you're converting your labels correctly? Maybe some entries from your CSV aren't parsed correctly? From the warning message, it looks like you might have ended up with a string representation of a list somewhere, so you end up with labels like:

That usually happens if you iterate over a string like '["Économie"]' instead of an actual list, so you get on entry for every letter.

Btw, for general spaCy usage questions, you can also post them on the discussions forum: Discussions · explosion/spaCy · GitHub