Base model without tok2vec throws error

Hello !

I have an issue since the version v1.11.12. As is stated in the doc, some bug was fixed around the --base-model usage. When I try to use a base model for NER on a simple dataset (I'm using fr_dep_news_trf) prodigy returns the following error :

KeyError: "[E001] No component 'tok2vec' found in pipeline. Available names: ['transformer', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer']"

Which is... normal actually, since the fr_dep_news_trf model does not have any tok2vec component ! It seems like the prodigy train recipe assumes that any model used for training has a tok2vec component, even though it seems that the new spacy-transformers allows a new norm, using a transformer component directly instead of a tok2vec.

Here is the command I use for reference :

prodigy train ./outputs/fr_model/ --ner my_dataset --base-model fr_dep_news_trf --gpu-id 0

my_dataset contains only 'ORG' annotations for a simple NER model, as it is what I need to identify.

Am I wrong ? Is there a way to bypass this issue, or should I rewrite a training recipe to suit my needs ?

Kind regards,
Martin

1 Like

We're aware of that error. There's an issue with transformers that's unrelated to the bug fixed in v1.11.12. It's also a bug that's more upstream, caused by an issue in the spaCy codebase. The team is aware of it though and is currently working on a fix.

Another thread on this issue can be found here:

That thread also has a temporary workaround that involves writing a custom config.cfg file. It's not a proper solution, but might serve as a remedy for now.

1 Like

Thank you for the quick response ! I tried to find previous threads but didn't find this one. I will look into their temporary solution then, and look forward to the team fixing this. I guess I can delete this topic then ?

1 Like

It can't hurt to leave open now, since a link to the other example exists.

Now that I think of it .... it might even be better to leave the topic open because, as you say, you weren't able to find the thread yourself. If we keep this open, maybe Google/Discourse will have an easier time indexing more appropriate keywords that eventually lead to the right thread.

Should this be fixed in 1.13? Is there a generic explanation for the config file workaround you did in the other thread?

I'm trying to train with the en_web_core_trf base model and run into

[E001] No component 'tok2vec' found in pipeline. Available names: ['transformer', 'tagger', 'parser', 'at
tribute_ruler', 'lemmatizer', 'ner']

with prodigy train ./models --ner dataset --base-model e n_core_web_trf

2 Likes

Dear koaning, thank you for providing the link.
I want to use the en_core_web_trf and train span cat component. I have created the config file where I deleted all lines where "tok2vec" was present. However, I still get the error

"[E001] No component 'tok2vec' found in pipeline. Available names: ['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']"

after running the following command:

python -m prodigy train ./model --spancat trans_span_labeled_dataset --config filled_modified_config.cfg --base-model en_core_web_trf

Here is my config file:

[paths]
train = null
dev = null
vectors = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","spancat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

I would greatly appreciate if you could give me a hint of how to resolve this problem.
Thank you!

1 Like

Hello DariaS,

What I ended up doing after some investigation, was to remove the transformer step in the pipeline entirely, to opt for a tok2vec declaration inside my ner step ! Here is my config file for reference :

[paths]
train = "./dvcstore/data/train.spacy"
dev = "./dvcstore/data/dev.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "fr"
pipeline = ["ner"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 64
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecTransformer.v3"
name = "camembert-base"
grad_factor = 1.0
mixed_precision = false
pooling = {"@layers":"reduce_mean.v1"}

[components.ner.model.tok2vec.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner.model.tok2vec.grad_scaler_config]

[components.ner.model.tok2vec.tokenizer_config]
use_fast = false

[components.ner.model.tok2vec.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 3
patience = 2000
max_epochs = 0
max_steps = 20000
eval_frequency = 500
frozen_components = []
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = false
get_length = null
size = 2000
buffer = 256

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.9992
L2_is_weight_decay = true
L2 = 0.001
grad_clip = 1.0
use_averages = true
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
pos_acc = null
morph_acc = null
morph_per_feat = null
dep_uas = null
dep_las = null
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = null
lemma_acc = null
speed = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
vocab_data = null
lookups = null
before_init = null
after_init = null
init_tok2vec = ${paths.init_tok2vec}

[initialize.components]

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "spacy_training/labels/ner.json"
require = false

[initialize.tokenizer]

To note really is that I used only pipeline = ["ner"] and nothing else, and also that :


[components.ner.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecTransformer.v3"
name = "camembert-base"
grad_factor = 1.0
mixed_precision = false
pooling = {"@layers":"reduce_mean.v1"}

[components.ner.model.tok2vec.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner.model.tok2vec.grad_scaler_config]

[components.ner.model.tok2vec.tokenizer_config]
use_fast = false

[components.ner.model.tok2vec.transformer_config]

I even left the grad_factor at 1.0, given that I had enough examples to finetune my model... But I guess this is mostly spacy-fu and not especially related to prodigy !

Dear Martin,
thank you so much for your time and your answer.
I tried to use your config file, but still keep getting the same eror

KeyError: "[E001] No component 'tok2vec' found in pipeline. Available names: ['transformer', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']"

after running the command

python -m prodigy train ./model --spancat trans_span_labeled_dataset --config filled_config_changed.cfg --base-model en_core_web_trf

This might be because you are still using the --base-model parameter, while I don't ! I should have added this, but the only place a model is specified in my case is there :

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecTransformer.v3"
name = "camembert-base" # this here is the huggingface model to use
grad_factor = 1.0
mixed_precision = false
pooling = {"@layers":"reduce_mean.v1"}

inside the configuration file you specified ! I am not an expert but I think you can specify any kind of spacy model you have at your disposal ! I do not use a base model anymore, and that was what solved my problem.

Has this been fixed, yet? The original bug is a year old.

Hi @James,

Thanks for raising the issue and I certainly understand your concern.
We have started to work on it but for a number of reasons this fix got deprioritized. I'll make sure it's back on the agenda for the fortcoming sprint.
Again, apologies for the delay in addressing it.

Hi everyone,

I'd like to update that the bug that prevented the use of transformer spaCy pipelines as base models is fixed in the recently released Prodigy 1.15.1 (changelog)
Again, apologies for the delay in addressing this issue.