Default model for textcat-multilabel

VeronicaCPerez · October 12, 2021, 3:58pm

Hi,

In the spacy configuration in this page it is said that the default model for Textcat is TextCatEnsemble. However when I run prodigy without specifying my model the configuration file resulting says I'm using spacy.TextCatBOW.v2.

When I make my own configuration file adding the TextCatEnsemble.v2 as explained in the spacy website I get the following /opt/homebrew/lib/python3.9/site-packages/thinc/layers/layernorm.py:32: RuntimeWarning: divide by zero encountered in reciprocal d_xhat = N * dY - sum_dy - dist * var ** (-1.0) * sum_dy_dist

This is what I'm using for my configuration file: is the same as the one that prodigy created when I used the default I just changed the model for textcat_multilabel

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","tagger","parser","attribute_ruler","lemmatizer","ner","textcat_multilabel"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 256
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.attribute_ruler]
factory = "attribute_ruler"
validate = false

[components.lemmatizer]
factory = "lemmatizer"
mode = "rule"
model = null
overwrite = false

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = true

[components.ner.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.parser.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = true

[components.parser.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.tagger]
factory = "tagger"

[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.tagger.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = true

[components.tagger.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.textcat_multilabel]
factory = "textcat_multilabel"
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false


[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 64
rows = [2000, 2000, 1000, 1000, 1000, 1000]
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = ${model.tok2vec.embed.width}
window_size = 1
maxout_pieces = 3
depth = 2

[corpora]
@readers = "prodigy.MergedCorpus.v1"
eval_split = 0.5
sample_size = 1.0
ner = null
textcat = null
parser = null
tagger = null
senter = null
spancat = null

[corpora.textcat_multilabel]
@readers = "prodigy.TextCatCorpus.v1"
datasets = ["sectoral_annotations"]
eval_datasets = []
exclusive = false

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.15
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 0
eval_frequency = 1000
frozen_components = ["tagger","parser","attribute_ruler","lemmatizer","ner"]
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "prodigy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = null
dep_uas = null
dep_las = null
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = null
lemma_acc = null
ents_f = null
ents_p = null
ents_r = null
ents_per_type = null
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = "en_core_web_lg"
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

ines · October 15, 2021, 1:35pm

Hi! I just had a look and it seems like you're seeing the same inconsistency pointed out here:

github.com/explosion/spaCy

Training Quickstart: Pipeline Differences for textcat exclusive categories

opened 04:58PM - 08 Oct 21 UTC

pmbaumgartner

docs feat / textcat

I was taking a look at the Training docs and using the Config widget with only a… `textcat` component and noticed something: when I select "exclusive categories" for a textcat component, when that value is false, the pipeline does not contain a `tok2vec` component. However, when that's true, it does. > exclusive categories=false ``` pipeline = ["textcat_multilabel"] ``` > exclusive categories=true ``` pipeline = ["tok2vec","textcat"] ``` Is this right, or does the multilabel config also need a `tok2vec` component? FWIW, the `tok2vec` component is still defined later in the config file. ## Which page or section is this issue related to? https://spacy.io/usage/training#quickstart

This just comes down to the differences between the config auto-generation template and the defaults specified within the library. This is slightly confusing so we'll be making the template consistent

Topic		Replies	Views
No component 'tok2vec' error when trying to improve a textcat multilabel model bug , textcat , solved , training	3	38	July 30, 2024
Use textcat and textcat_multilabel in the same model textcat , spacy	1	347	May 19, 2022
textcat.correct Exclusive Categories usage , textcat	6	391	February 26, 2023
Unable to train textcat model using en_core_web_md as a base model textcat	11	1690	May 2, 2023
Exporting dataset from prodigy and train textcat in spaCy v3 textcat , done , spacy	6	895	August 12, 2021

Default model for textcat-multilabel

Related topics