Question about configuration file when use en_core_scibert model for ner

Dear Prodigy team:

I'm training a custom NER model using the en_core_sci_scibert base model to identify a new entity label not included in the original model. I have already annotated my training data using ner.manual, relying on en_core_sci_scibert for tokenization. To train the new NER model, I removed the original NER component using nlp.remove_pipe("ner") in Python. Then, I tried two different approaches:

Approach 1: Remove NER and save model

  • After removing the NER component, I saved the model as en_core_sci_scibert_without_ner. Here, I have a ner folder in the model directory and a ner component in the pipeline.
  • In the automatically generated config.cfg file for en_core_sci_scibert_without_ner model:

pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser"],
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"] annotating_components = [ ].

If I train the model using the config.cfg file with above parameters co, I got the error message:

ValueError: [E203] If the tok2vec embedding layer is not updated during training, make sure to include it in 'annotating components'.

Approach 2: Add blank NER and save model

  • After removing the NER component, I added a blank NER component via nlp.add_pipe("ner", last=True) and saved the model as en_core_sci_scibert_empty_ner. Here, the model file does not contain a ner folder, and there is no ner component in the pipeline.
    • In the automatically generated config.cfg file for en_core_sci_scibert_empty_ner model:

pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser","ner"]
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"]
annotating_components = [ ]

If I train the model using the above parameters, I got the error message:

KeyError: "[E022] Could not find a transition with the name 'O' in the NER model."

My questions:

  1. Does my overall approach make sense? Which method is more appropriate? Is it necessary to add a blank NER component before training?
  2. In either case, should I add "ner" to annotating_components while keeping the other components frozen as listed? ( I tried different combinations of the pipeline, frozen_components and annotating_components arguments, but I encountered different error messages each time)
  3. Can you help explain and resolve each error?
  • Especially the [E022] error: `Could not find a transition with the name 'O'.

Personally, I think adding a blank NER component makes more sense—without it, there's no ner folder in the model, which leads to:

FileNotFoundError: [Errno 2] No such file or directory: 'en_core_sci_scibert_without_ner/ner/moves'

I apologize for the verbose wording, and I sincerely appreciate any suggestions and help you can offer!

Hi @Fangjian!

No need to apologize - your question is very well structured and we can clearly see what you've already tried!

  1. Does my overall approach make sense? Which method is more appropriate? Is it necessary to add a blank NER component before training?

You overall approach of substituting the NER component in a pre-trained pipeline makes total sense. It's a very common use of transfer learning to leverage the embeddings and potentially other linguistic features to train a custom component.

In fact, in the spaCy project's repository, there's an example that does just that. You should be able to adapt it to your case so I highly recommend it as reference for you.

Between the two approaches you mentioned, approach no. 2 i.e. substituting the pre-trained NER component with the blank NER component is the correct one. You are right in thinking that the NER component must be added to pipeline and correctly initialized before training.

  1. In either case, should I add "ner" to annotating_components while keeping the other components frozen as listed? ( I tried different combinations of the pipeline, frozen_components and annotating_components arguments, but I encountered different error messages each time)

Since what you want is to train the NER component only, you should freeze all the remaining components.
It's not required to add NER to annotating components because there's no other component that depends on its annotations during training/inference.
The [E203] error you're getting is because without adding a frozen tok2vec layer to annotating_components it would be impossible for the remaining pipeline components to use it. This might be resolved by adding the tok2vec layer to the annotating components but it shouldn't be necessary if you configure your blank NER component to listen to frozen embedding layer. Please consult the the spaCy NER demo I linked above for the correct training config template. Also, spaCy docs here on annotating components.

Especially the [E022] error: `Could not find a transition with the name 'O'.

This happens when you add a blank NER but don’t provide any labels. That means the model doesn’t know what entities to recognize, not even the “O” label for "outside any entity". This is normally automated by the spacy train command so I'd need to see your full config and the training command and also how you exported the the annotations from Prodigy.
Also, I really recommend you try to use the spaCy NER demo project as a guidance as you should be able to configure the NER substitution and correct initialization via config.

I appreciate your prompt and detailed response.

I downloaded the config file from the ner_demo_replace example project and compared it with my own. Based on my observations, I couldn’t find anything obviously wrong. However, I did notice that the example project uses the en_core_web_sm model, which is tok2vec-based, while I am using en_core_sci_scibert, which is transformer-based. This leads to differences in model architecture such as how the tagger and parser components are configured in the [components] section.

I suspect this architectural difference is what causes the error: [E022] Could not find a transition with the name 'O' in the NER model.

I appreciate it if you could help me identify what is wrong with my current configuration file:

[paths]
vectors = null
init_tok2vec = null
parser_tagger_path = "output/en_core_sci_scibert_parser_tagger/model-best"
vocab_path = null
train = null
dev = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser","ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.attribute_ruler]
factory = "attribute_ruler"
scorer = {"@scorers":"spacy.attribute_ruler_scorer.v1"}
validate = false

[components.lemmatizer]
factory = "lemmatizer"
mode = "rule"
model = null
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.tagger]
factory = "tagger"
label_smoothing = 0.0
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.tagger.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "allenai/scibert_scivocab_uncased"
mixed_precision = true

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "med_mentions_reader"
directory_path = "assets/"
split = "dev"

[corpora.train]
@readers = "med_mentions_reader"
directory_path = "assets/"
split = "train"

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 0
max_epochs = 7
max_steps = 0
eval_frequency = 500
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"]
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 1
stop = 32
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = null
lemma_acc = 0.5
dep_uas = null
dep_las = null
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = null
ents_f = 0.5
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = ${paths.vocab_path}
lookups = null
before_init = {"@callbacks":"replace_tokenizer"}
after_init = null

[initialize.components]

[initialize.tokenizer]

Hi @Fangjian,

That's right, the example uses a different pre-trained spaCy pipeline but the mechanism of setting listeners, copying tokenizer and vocab settings is pretty much the same.
If you have en_core_sci_scibert installed in your environment, you could even set the demo project to use that as the base pipeline (line 7 in project.yml).

Regarding your config, you should be sourcing all the components except fro ner from en_core_sci_scibert` - currently you're initializing them from scratch.
So instead:

[components.attribute_ruler]
factory = "attribute_ruler"
scorer = {"@scorers":"spacy.attribute_ruler_scorer.v1"}
validate = false

it should be:

[components.attribute_ruler]
source = "en_core_sci_scibert"

and so for all the remaining components except for NER.
As for copying the tokenizer and vocab I can see you're using a custom callback. The default spaCy callback for this is:

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "en_core_sci_scibert"
vocab = "en_core_sci_scibert"

I can also see you're using custom readers and have a number of custom training parameters. I suppose these are optimized for training on the original en_core_sci_scibert dataset. If you're going to train using spaCy the custom NER component on the dataset annotated with Prodigy you could start with spaCy defaults for trainign transformer pipelines which would be this:

[paths]
vectors = null
init_tok2vec = null
parser_tagger_path = "output/en_core_sci_scibert_parser_tagger/model-best"
vocab_path = null
train = null
dev = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser","ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.attribute_ruler]
source = "en_core_sci_scibert"

[components.lemmatizer]
source = "en_core_sci_scibert"

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

[components.parser]
source = "en_core_sci_scibert"

[components.tagger]
source = "en_core_sci_scibert"

[components.transformer]
source = "en_core_sci_scibert"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = ["transformer","tagger","attribute_ruler","lemmatizer","parser"]
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = ${paths.vocab_path}
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "en_core_sci_scibert"
vocab = "en_core_sci_scibert"

[initialize.components]

[initialize.tokenizer]