Question about configuration file when use en_core_scibert model for ner

Dear Prodigy team:

I'm training a custom NER model using the en_core_sci_scibert base model to identify a new entity label not included in the original model. I have already annotated my training data using ner.manual, relying on en_core_sci_scibert for tokenization. To train the new NER model, I removed the original NER component using nlp.remove_pipe("ner") in Python. Then, I tried two different approaches:

Approach 1: Remove NER and save model

  • After removing the NER component, I saved the model as en_core_sci_scibert_without_ner. Here, I have a ner folder in the model directory and a ner component in the pipeline.
  • In the automatically generated config.cfg file for en_core_sci_scibert_without_ner model:

pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser"],
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"] annotating_components = [ ].

If I train the model using the config.cfg file with above parameters co, I got the error message:

ValueError: [E203] If the tok2vec embedding layer is not updated during training, make sure to include it in 'annotating components'.

Approach 2: Add blank NER and save model

  • After removing the NER component, I added a blank NER component via nlp.add_pipe("ner", last=True) and saved the model as en_core_sci_scibert_empty_ner. Here, the model file does not contain a ner folder, and there is no ner component in the pipeline.
    • In the automatically generated config.cfg file for en_core_sci_scibert_empty_ner model:

pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser","ner"]
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"]
annotating_components = [ ]

If I train the model using the above parameters, I got the error message:

KeyError: "[E022] Could not find a transition with the name 'O' in the NER model."

My questions:

  1. Does my overall approach make sense? Which method is more appropriate? Is it necessary to add a blank NER component before training?
  2. In either case, should I add "ner" to annotating_components while keeping the other components frozen as listed? ( I tried different combinations of the pipeline, frozen_components and annotating_components arguments, but I encountered different error messages each time)
  3. Can you help explain and resolve each error?
  • Especially the [E022] error: `Could not find a transition with the name 'O'.

Personally, I think adding a blank NER component makes more sense—without it, there's no ner folder in the model, which leads to:

FileNotFoundError: [Errno 2] No such file or directory: 'en_core_sci_scibert_without_ner/ner/moves'

I apologize for the verbose wording, and I sincerely appreciate any suggestions and help you can offer!

Hi @Fangjian!

No need to apologize - your question is very well structured and we can clearly see what you've already tried!

  1. Does my overall approach make sense? Which method is more appropriate? Is it necessary to add a blank NER component before training?

You overall approach of substituting the NER component in a pre-trained pipeline makes total sense. It's a very common use of transfer learning to leverage the embeddings and potentially other linguistic features to train a custom component.

In fact, in the spaCy project's repository, there's an example that does just that. You should be able to adapt it to your case so I highly recommend it as reference for you.

Between the two approaches you mentioned, approach no. 2 i.e. substituting the pre-trained NER component with the blank NER component is the correct one. You are right in thinking that the NER component must be added to pipeline and correctly initialized before training.

  1. In either case, should I add "ner" to annotating_components while keeping the other components frozen as listed? ( I tried different combinations of the pipeline, frozen_components and annotating_components arguments, but I encountered different error messages each time)

Since what you want is to train the NER component only, you should freeze all the remaining components.
It's not required to add NER to annotating components because there's no other component that depends on its annotations during training/inference.
The [E203] error you're getting is because without adding a frozen tok2vec layer to annotating_components it would be impossible for the remaining pipeline components to use it. This might be resolved by adding the tok2vec layer to the annotating components but it shouldn't be necessary if you configure your blank NER component to listen to frozen embedding layer. Please consult the the spaCy NER demo I linked above for the correct training config template. Also, spaCy docs here on annotating components.

Especially the [E022] error: `Could not find a transition with the name 'O'.

This happens when you add a blank NER but don’t provide any labels. That means the model doesn’t know what entities to recognize, not even the “O” label for "outside any entity". This is normally automated by the spacy train command so I'd need to see your full config and the training command and also how you exported the the annotations from Prodigy.
Also, I really recommend you try to use the spaCy NER demo project as a guidance as you should be able to configure the NER substitution and correct initialization via config.

I appreciate your prompt and detailed response.

I downloaded the config file from the ner_demo_replace example project and compared it with my own. Based on my observations, I couldn’t find anything obviously wrong. However, I did notice that the example project uses the en_core_web_sm model, which is tok2vec-based, while I am using en_core_sci_scibert, which is transformer-based. This leads to differences in model architecture such as how the tagger and parser components are configured in the [components] section.

I suspect this architectural difference is what causes the error: [E022] Could not find a transition with the name 'O' in the NER model.

I appreciate it if you could help me identify what is wrong with my current configuration file:

[paths]
vectors = null
init_tok2vec = null
parser_tagger_path = "output/en_core_sci_scibert_parser_tagger/model-best"
vocab_path = null
train = null
dev = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser","ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.attribute_ruler]
factory = "attribute_ruler"
scorer = {"@scorers":"spacy.attribute_ruler_scorer.v1"}
validate = false

[components.lemmatizer]
factory = "lemmatizer"
mode = "rule"
model = null
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.tagger]
factory = "tagger"
label_smoothing = 0.0
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v1"
nO = null

[components.tagger.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "allenai/scibert_scivocab_uncased"
mixed_precision = true

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "med_mentions_reader"
directory_path = "assets/"
split = "dev"

[corpora.train]
@readers = "med_mentions_reader"
directory_path = "assets/"
split = "train"

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 0
max_epochs = 7
max_steps = 0
eval_frequency = 500
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"]
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 1
stop = 32
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = null
lemma_acc = 0.5
dep_uas = null
dep_las = null
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = null
ents_f = 0.5
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = ${paths.vocab_path}
lookups = null
before_init = {"@callbacks":"replace_tokenizer"}
after_init = null

[initialize.components]

[initialize.tokenizer]

Hi @Fangjian,

That's right, the example uses a different pre-trained spaCy pipeline but the mechanism of setting listeners, copying tokenizer and vocab settings is pretty much the same.
If you have en_core_sci_scibert installed in your environment, you could even set the demo project to use that as the base pipeline (line 7 in project.yml).

Regarding your config, you should be sourcing all the components except fro ner from en_core_sci_scibert` - currently you're initializing them from scratch.
So instead:

[components.attribute_ruler]
factory = "attribute_ruler"
scorer = {"@scorers":"spacy.attribute_ruler_scorer.v1"}
validate = false

it should be:

[components.attribute_ruler]
source = "en_core_sci_scibert"

and so for all the remaining components except for NER.
As for copying the tokenizer and vocab I can see you're using a custom callback. The default spaCy callback for this is:

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "en_core_sci_scibert"
vocab = "en_core_sci_scibert"

I can also see you're using custom readers and have a number of custom training parameters. I suppose these are optimized for training on the original en_core_sci_scibert dataset. If you're going to train using spaCy the custom NER component on the dataset annotated with Prodigy you could start with spaCy defaults for trainign transformer pipelines which would be this:

[paths]
vectors = null
init_tok2vec = null
parser_tagger_path = "output/en_core_sci_scibert_parser_tagger/model-best"
vocab_path = null
train = null
dev = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser","ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.attribute_ruler]
source = "en_core_sci_scibert"

[components.lemmatizer]
source = "en_core_sci_scibert"

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true

[components.parser]
source = "en_core_sci_scibert"

[components.tagger]
source = "en_core_sci_scibert"

[components.transformer]
source = "en_core_sci_scibert"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = ["transformer","tagger","attribute_ruler","lemmatizer","parser"]
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = ${paths.vocab_path}
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "en_core_sci_scibert"
vocab = "en_core_sci_scibert"

[initialize.components]

[initialize.tokenizer]

Thank you so much for your prompt response.

I updated the tagger and parser components to source from en_core_sci_scibert and added the initialize.before_init callback as recommended. I also replaced @architectures = "spacy.TransitionBasedParser.v2" with @architectures = "spacy-transformers.TransformerBasedNER.v1" to ensure the NER model uses the transformer-based architecture. However, I'm still encountering the same error:
[E022] Could not find a transition with the name 'O' in the NER model. Despite making multiple changes, the issue persists, and I’m not sure what’s causing it.

I’m unsure what's causing this, especially since there aren’t many examples using en_core_sci_scibert — most people seem to use en_core_web_trf for transformer-based workflows.

On a related note: I used en_core_sci_scibert as my base model when running ner.manual, but when I reviewed the tokenized text afterward, it didn’t appear to be tokenized by a transformer model. For example, when I previously used bert.ner.manual, I could clearly see special tokens like [CLS] and [SEP], indicating BERT-style tokenization.

This makes me wonder: how does tokenization actually work for BERT-based models in spaCy and Prodigy? When using en_core_sci_scibert, it feels more like a wrapper around a traditional spaCy model such as en_core_web_sm, at least during the ner.manual stage. I understand that Prodigy handles alignment automatically and that tokenization can be handled in post-processing, but I would appreciate some clarification on how tokenization works under the hood in this setup.

Thank you so much for your further assistance!

Hi @Fangjian,

BERT-based models in spaCy use an alignment system between spaCy's linguistic tokenization and BERT's wordpiece tokenization. The spacy-transformers library calculates an alignment to spaCy's linguistic tokenization, so you can relate the transformer features back to actual words, instead of just wordpieces. You can find out more about how exactly it's being done in this blog post.

Prodigy uses tokenization provided by spaCy, so if you used en_core_sci_scibert as your base model in ner.manual you'd effectively be getting en_core_sci_scibert tokenization.
It's true that until Prodigy 1.15.1 the sourcing of the tokenizer was not automated (you had to provide a config file with copying instruction) but if your Prodigy is >= 1.15.1 this should work out of the box.

The reason you saw wordpiece tokens such as [CLS] is that bert.ner.manual is because it is not using spaCy for tokenization. It uses HuggingFace tokenizers library. If you use spaCy pipeline, you don't have to worry about the alignment it's being taken care of under the hood.

Now, to make sure your data is tokenized using en_core_sci_scibert tokenizer when running ner.manual, you can make a simple test.

  1. tokenize a sentence that would have a different tokenization in en_core_sci_scibert and en_core_web_sm e.g. "SARS-CoV-2 spike protein binds to ACE2 receptors."
import spacy

def compare_tokens():
    """
    Compare tokens between SciBERT and standard English tokenizer
    """
    
    # Load models
    # assuming both are installed in the local virtual environment
    scibert = spacy.load("en_core_sci_scibert")
    standard = spacy.load("en_core_web_sm")
    
    # Test texts
    test_texts = [
        "The mitochondria is the powerhouse of the cell.",
        "COVID-19 pandemic affected healthcare systems.",
        "SARS-CoV-2 spike protein binds to ACE2 receptors.",
        "CRISPR-Cas9 gene editing technology is revolutionary.",
        "Machine learning algorithms predict protein folding."
    ]
    
    for text in test_texts:
        print(f"\nText: '{text}'")
        print("-" * 50)
        
        scibert_tokens = [token.text for token in scibert(text)]
        standard_tokens = [token.text for token in standard(text)]
        
        print(f"SciBERT:  {scibert_tokens}")
        print(f"Standard: {standard_tokens}")
        
        if scibert_tokens != standard_tokens:
            print("⚠️  Different tokenization!")
        else:
            print("✓ Same tokenization")

compare_tokens()

You'll see that, for example:

Text: 'SARS-CoV-2 spike protein binds to ACE2 receptors.'
--------------------------------------------------
SciBERT:  ['SARS-CoV-2', 'spike', 'protein', 'binds', 'to', 'ACE2', 'receptors', '.']
Standard: ['SARS', '-', 'CoV-2', 'spike', 'protein', 'binds', 'to', 'ACE2', 'receptors', '.']
⚠️  Different tokenization!

If you then use this sentence as input to ner.manual:

python -m prodigy ner.manual test en_core_sci_scibert input.jsonl -l FOO

You'll see that when you try to annotate SARS-CoV-2 it would snap on the entire phrase and if you inspect the annotated dataset you'll see that SARS-CoV-2 is listed as a single token:

{
  "text": "SARS-CoV-2 spike protein binds to ACE2 receptors",
  "_input_hash": -610406285,
  "_task_hash": -1006421252,
  "_is_binary": false,
  "tokens": [
    {
      "text": "SARS-CoV-2",
      "start": 0,
      "end": 10,
      "id": 0,
      "ws": true
    },
...

To sum up, spaCy internally takes care of the alignment and it produces linguistic tokens. You can inspect the alignment information under by looking at doc._.trf_data.

This also means that data should be ready to train directly from Prodigy. Assuming your training config looks like the one I posted above, you should be able to kick off training by running:

python -m prodigy train output --ner test --config config.cfg

The error you're reporting:

[E022] Could not find a transition with the name 'O' in the NER model.

suggests the NER component does not have any labels. That could happen if you used the manually created pipeline with add_pipe like you described above. In this procedure you should also be adding labels. In any case, you shouldn't really need to manipulate the pipeline programmatically if you configure it via config file as discussed above. If you're still getting issues, please share the full script and/or commands that you're using to trigger the training.