Converting outside annotations in proper format for textcat_muultilabel in spacy v3

tavernierdetk · August 19, 2021, 7:37pm

I've posted this on stack overflow (https://stackoverflow.com/questions/68832693/training-a-textcat-multilabel-model-in-spacy-v3-data-format-problem) with no success yet, so I'm trying here:

I'm trying to convert data from a csv into a DocBin to train a model with a textcat_multilabel component.
doc.cats, before being added to the DocBin and serialized looks like this:

{'Santé': 1, 'Économie': 0, 'Infrastructure': 0, 'Politique fédérale': 0, 'Politique provinciale': 1, 'Politique municipale': 0, 'Éducation': 0, 'Faits divers': 0, 'Culture': 0}

Full error message when running the spacy train CLI command:

ℹ Using CPU

=========================== Initializing pipeline ===========================
[2021-08-18 06:09:46,242] [INFO] Set up nlp object from config
[2021-08-18 06:09:46,259] [INFO] Pipeline: ['tok2vec', 'textcat_multilabel', 'ner', 'parser']
[2021-08-18 06:09:46,266] [INFO] Created vocabulary
[2021-08-18 06:09:50,649] [INFO] Added vectors: fr_core_news_lg
[2021-08-18 06:09:56,557] [INFO] Finished initializing nlp object
[2021-08-18 06:10:07,714] [INFO] Initialized pipeline components: ['tok2vec', 'textcat_multilabel', 'ner', 'parser']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'textcat_multilabel', 'ner', 'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTC...  LOSS NER  LOSS PARSER  CATS_SCORE  ENTS_F  ENTS_P  ENTS_R  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  ------------  -------------  --------  -----------  ----------  ------  ------  ------  -------  -------  -------  ------
Traceback (most recent call last):
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/loop.py", line 281, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/language.py", line 1389, in evaluate
    results = scorer.score(examples)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/scorer.py", line 135, in score
    scores.update(component.score(examples, **self.cfg))
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/pipeline/textcat_multilabel.py", line 179, in score
    return Scorer.score_cats(
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/scorer.py", line 465, in score_cats
    auc_per_type[label].score_set(pred_score, gold_score)
KeyError: 'Politique fédérale'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/cli/_util.py", line 69, in setup_cli
    command(prog_name=COMMAND)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/cli/train.py", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/loop.py", line 122, in train
    raise e
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/loop.py", line 105, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/loop.py", line 226, in train_while_improving
    score, other_scores = evaluate()
  File "/Users/alex/PycharmProjects/Ecosysteme_NLP_Sentinelle/python39clean/lib/python3.9/site-packages/spacy/training/loop.py", line 283, in evaluate
    raise KeyError(Errors.E900.format(pipeline=nlp.pipe_names)) from e
KeyError: "[E900] Could not run the full pipeline for evaluation. If you specified frozen components, make sure they were already initialized and trained. Full pipeline: ['tok2vec', 'textcat_multilabel', 'ner', 'parser']"

Output when running spacy debug data CLI command:

====================== Text Classification (Multilabel) ======================
ℹ Text Classification: 30 label(s)
⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training: 'v', 'F', 'm', 'f',
'É', 'l', 'c', 'q', 'o', ']', 'u', 'I', 'P', 'r', 'a', 'D', 'é', 'S', 't', ',',
'M', ' ', 's', ''', 'd', 'i', 'p', 'e', 'n', '['.

Looking at the list of labels in that output, I'm pretty sure my problem is with the way I'm formatting my dictionary used to set doc.cats but I can't seem to find the proper way to format it. I'm sure it's somewhere in the documentation, but I can't seem to find it and feel a bit silly...

I also tried by formatting to the Prodigy .jsonl format, and using the db-in then data-to-spacy order of commands, but I end up with the same problem. Sample line from the .jsonl file:

{"text":"Fran\u00e7ois Legault doit tenir une autre conf\u00e9rence de presse ce soir \u00e0 17 h","options":[{"id":"Sant\u00e9","text":"Sant\u00e9"},{"id":"Politique_Provinciale","text":"Politique_Provinciale"}],"_view_id":"choice","config":{"choice_style":"multiple"},"accept":["Politique_Provinciale"],"answer":"accept"}

Any help would be amazing

ines · August 19, 2021, 10:30pm

Are you sure you're converting your labels correctly? Maybe some entries from your CSV aren't parsed correctly? From the warning message, it looks like you might have ended up with a string representation of a list somewhere, so you end up with labels like:

That usually happens if you iterate over a string like '["Économie"]' instead of an actual list, so you get on entry for every letter.

Btw, for general spaCy usage questions, you can also post them on the discussions forum: explosion/spaCy · Discussions · GitHub

Topic		Replies	Views
training data format for multiclass textcat Getting Started usage , textcat	7	1461	August 29, 2022
Don't understand the label files from data-to-spacy usage , textcat	2	506	February 5, 2022
Unable to train textcat model using en_core_web_md as a base model textcat	11	1632	May 2, 2023
Exporting dataset from prodigy and train textcat in spaCy v3 textcat , done , spacy	6	890	August 12, 2021
textcat_multilabel with only some labels annotated for some examples	5	375	June 14, 2022

Converting outside annotations in proper format for textcat_muultilabel in spacy v3

Related topics