Understanding config file output by textcat based on ja_core_news_trf

jhandsel · October 22, 2025, 7:16am

I’m trying to understand the config file created by Prodigy when training textcat on top of the Japanese transformer model. I’m training with the following command:

prodigy train my_model --textcat-multilabel my_dataset --base-model ja_core_news_trf

Looking at the config output from the above command, I see a textcat component, but I don’t see any listeners that would take features from the transformer. Am I correct in thinking that this config results in a simple bag of words classifier that doesn't make use of the transformer layer?

The full config file:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "ja"
pipeline = ["transformer","morphologizer","parser","attribute_ruler","ner","textcat_multilabel"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 64
vectors = {"@vectors":"spacy.Vectors.v1"}

[nlp.tokenizer]
@tokenizers = "spacy.ja.JapaneseTokenizer"
split_mode = null

[components]

[components.attribute_ruler]
source = "ja_core_news_trf"

[components.morphologizer]
source = "ja_core_news_trf"
replace_listeners = ["model.tok2vec"]

[components.ner]
source = "ja_core_news_trf"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "ja_core_news_trf"
replace_listeners = ["model.tok2vec"]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v2"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatBOW.v3"
exclusive_classes = false
length = 262144
ngram_size = 1
no_output_layer = false
nO = null

[components.transformer]
source = "ja_core_news_trf"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 3
patience = 5000
max_epochs = 0
max_steps = 20000
eval_frequency = 1000
frozen_components = ["morphologizer","parser","attribute_ruler","ner"]
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
size = 2000
tolerance = 0.2
get_length = null

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
pos_acc = null
morph_micro_f = null
morph_per_feat = null
dep_uas = null
dep_las = null
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = null
ents_f = null
ents_p = null
ents_r = null
ents_per_type = null
morph_acc = 0.11
speed = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "ja_core_news_trf"
vocab = "ja_core_news_trf"

[initialize.components]

[initialize.components.morphologizer]

[initialize.components.morphologizer.labels]
@readers = "spacy.read_labels.v1"
path = "negative_teach_biased_spacy/labels/morphologizer.json"

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "negative_teach_biased_spacy/labels/ner.json"

[initialize.components.parser]

[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "negative_teach_biased_spacy/labels/parser.json"

[initialize.components.textcat_multilabel]

[initialize.components.textcat_multilabel.labels]
@readers = "spacy.read_labels.v1"
path = "negative_teach_biased_spacy/labels/textcat_multilabel.json"

[initialize.tokenizer]

magdaaniol · November 4, 2025, 10:46am

Hi @jhandsel,

You’re completely right — the reason your config ends up with a simple bag-of-words (TextCatBOW.v3) textcat component instead of a transformer-aware one is because of how Prodigy decides whether to optimize for efficiency or accuracy.
Japanese transformer models like ja_core_news_trf don’t include pretrained word vectors, so Prodigy mistakenly generates the “efficiency” setup (the older TextCatBOW model) instead of the “accuracy” setup that connects to the transformer via embedding layer using e.g spacy.TextCatEnsemble.v2 architecture.

The code responsible for this was originally written for older pipelines (e.g. English en_core_web_lg) that bundled pretrained word vectors, but it obviously is not correct for the newer Japanese transformer models.

Until the Prodigy logic is updated, you have two options:

Pass the corrected config manually to Prodigy train recipe:

prodigy train my_model --textcat-multilabel my_dataset --base-model ja_core_news_trf --config correct_transformer_config.cfg

Export your annotations from Prodigy to spaCy DocBin format using data-to-spacy command and train directly with spaCy using spaCy train command.

An example of the correct config for training with Prodigy using spacy.TextCatEnsemble.v2 architecture that listens to the shared transformer embedding layer would be:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "ja"
pipeline = ["transformer","morphologizer","parser","attribute_ruler","ner","textcat_multilabel"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 128

[nlp.tokenizer]
@tokenizers = "spacy.ja.JapaneseTokenizer"
split_mode = null

[components]

[components.attribute_ruler]
source = "ja_core_news_trf"

[components.morphologizer]
source = "ja_core_news_trf"
replace_listeners = ["model.tok2vec"]

[components.ner]
source = "ja_core_news_trf"
replace_listeners = ["model.tok2vec"]

[components.parser]
source = "ja_core_news_trf"
replace_listeners = ["model.tok2vec"]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v2"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v3"
exclusive_classes = false
length = 262144
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
source = "ja_core_news_trf"

[corpora]
@readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0

[corpora.textcat_multilabel]
@readers = "prodigy.TextCatCorpus.v1"
datasets = ["my_dataset"]
eval_datasets = []
exclusive = false

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 3
patience = 5000
max_epochs = 0
max_steps = 20000
eval_frequency = 1000
frozen_components = ["morphologizer","parser","attribute_ruler","ner"]
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
size = 2000
tolerance = 0.2
get_length = null

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
pos_acc = null
morph_micro_f = null
morph_per_feat = null
dep_uas = null
dep_las = null
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = null
ents_f = null
ents_p = null
ents_r = null
ents_per_type = null
morph_acc = 0.11
speed = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "ja_core_news_trf"
vocab = "ja_core_news_trf"

[initialize.components]

[initialize.tokenizer]

jhandsel · November 17, 2025, 2:57am

Thanks for confirming @magdaaniol

I have been working around the issue with a custom config file.

Topic		Replies	Views
Using transformer models inside prodigy and finetuning enhancement , usage , transformers	10	3701	May 1, 2020
Default model for textcat-multilabel usage , textcat , spacy , to-be-released	1	1028	October 15, 2021
Multi-Task Training with ProdigyHF ner , textcat	1	85	March 24, 2025
No component 'tok2vec' error when trying to improve a textcat multilabel model bug , textcat , solved , training	3	84	July 30, 2024
Prodigy textcat train optimization?? usage , textcat , spacy	3	569	March 23, 2020

Understanding config file output by textcat based on ja_core_news_trf

Related topics