textcat.correct Exclusive Categories

kylebigelow · February 18, 2023, 8:13am

I have a dataset that was annotated with non-exclusive multilabel cats. The config I used during training was intialized from the spaCy quickstart widget with the settings: textcat, GPU (transformer), and accuracy. Training had no issues and the saved pipeline config shows false for every exclusive* parameter. Somehow textcat.correct is inferring that the cats are exclusive and I can't trace the issue beyond the the infer_exclusive function on line 204 in the textcat.py file.

koaning · February 20, 2023, 12:41pm

Could you share the spaCy config that you used generated along with the command that you used to train? Did you use Prodigy or spaCy? Could you share the command that you used to train the model? Could you also share your Prodigy/spaCy versions? It'd also help if you could share the output of the spacy info command.

Also, could you run the following code and share the results?

import spacy 

nlp = spacy.load("path/to/trained/model")
pipe_config = nlp.get_pipe_config("textcat")
print(pipe_config)

pipe_config = nlp.get_pipe_config("textcat_multilabel")
print(pipe_config)

kylebigelow · February 22, 2023, 1:51pm

train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","textcat_multilabel"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v2"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

The command that I used to train was like:

pdgy train ./model/tcm_qualifications -tcm tcm_qualdata -c ./config/tcm_config.cfg -g 0

I have been using Prodigy to train. This is the spaCy info:

spaCy version    3.5.0
Location         C:\Users\kyleb\hazon\lib\site-packages\spacy
Platform         Windows-10-10.0.22621-SP0     
Python version   3.9.13
Pipelines        en_core_web_lg (3.5.0), en_core_web_md (3.5.0), en_core_web_sm (3.5.0), en_core_web_trf (3.5.0)

The pipe_config for textcat fails but this is the output for textcat_multilabel:

{'factory': 'textcat_multilabel', 'model': {'@architectures': 'spacy.TextCatEnsemble.v2', 'nO': None, 'linear_model': {'@architectures': 'spacy.TextCatBOW.v2', 'exclusive_classes': False, 'ngram_size': 1, 'no_output_layer': False, 'nO': None}, 'tok2vec': {'@architectures': 'spacy-transformers.TransformerListener.v1', 'grad_factor': 1.0, 'pooling': {'@layers': 'reduce_mean.v1'}, 'upstream': '*'}}, 'scorer': {'@scorers': 'spacy.textcat_multilabel_scorer.v2'}, 'threshold': 0.5}

koaning · February 23, 2023, 1:42pm

That makes sense. I can also see the "exclusive_classes" setting which is what I'd expect.

I figured I'd try and follow in your footsteps here with a base example first. I started with some examples, which I've annotated with some labels.

python -m prodigy textcat.manual issue-6374 examples.jsonl --label science,sports,positive,negative

The labels were assigned randomly and afterwards I trained a model.

python -m prodigy train model-out -tcm issue-6374 -c config.cfg

I'm using a different config because I dont have a GPU. I used this one.

[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","textcat_multilabel"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v2"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Once I'm done training I'm giving textcat.correct a spin. The terminal suggests 4 labels, each multilabel.

> python -m prodigy textcat.correct issue-6374 model-out/model-best examples.jsonl --label science,sports,positive,negative

Using 4 label(s): science, sports, positive, negative

ℹ Annotating non-exclusive categories based on 'textcat_multilabel'
component config

However, when I now check the annotation interface I see this:

There are a few interesting things that I see here that do suggest a bug.

The model output suggests two classes deserve to be selected here, but unfortunately I can only select one. That's a problem.
When I was annotating I may not have selected any examples for sports. That might explain why it does not show up, but the fact that I am passing this label from the command line seems to be ignored here. That's also a problem.

Short term fix

Just to confirm, is this the issue that you're experiencing? If so, I might have a fix for the short-term. You can create a prodigy.json file in your folder with the following setting:

{
    "choice_style": "multiple"
}

This will update the choice interface, per the docs here. This will make the interface look like this:

This doesn't fix problem #2, but it does adress problem #1. Does this suffice for now? Let me know if there are other issues. I will pick this up with the team since we may have found some bugs here.

Thanks for reporting!

kylebigelow · February 23, 2023, 7:48pm

This is an improvement over the radio interface. However, once I select another option that choice is saved and sends me to the next annotation task. This is sufficient since there are ~3 labels max/annotation, and I'm able go back to the task and update if I need to add another category. Thank you!

koaning · February 24, 2023, 9:49am

That's ... curious.

That "next annotation task" behavior might be related to the same issue and might also have a similar fix. Once again, via the prodigy.json settings file.

{
    "choice_style": "multiple",
    "choice_auto_accept": false
}

Could you let me know if this helps?

kylebigelow · February 26, 2023, 6:56pm

Yup, thanks for your help!

Topic		Replies	Views
textcat.correct is always annotating exclusive categories bug , solved	4	212	November 18, 2023
textcat vs textcat_multilabel usage , textcat , training	12	3272	September 13, 2023
E895 when training with textcat.manual --exclusive enhancement , usage , textcat , done , solved	5	783	September 8, 2021
Exporting dataset from prodigy and train textcat in spaCy v3 textcat , done , spacy	6	897	August 12, 2021
Why getting better result in textcat-multilabel than textcat?	13	334	September 11, 2023

textcat.correct Exclusive Categories

Short term fix

Related topics