Dear Prodigy team:
I'm training a custom NER model using the en_core_sci_scibert base model to identify a new entity label not included in the original model. I have already annotated my training data using ner.manual, relying on en_core_sci_scibert for tokenization. To train the new NER model, I removed the original NER component using nlp.remove_pipe("ner") in Python. Then, I tried two different approaches:
Approach 1: Remove NER and save model
- After removing the NER component, I saved the model as
en_core_sci_scibert_without_ner. Here, I have anerfolder in the model directory and anercomponent in the pipeline. - In the automatically generated
config.cfgfile foren_core_sci_scibert_without_nermodel:
pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser"],
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"] annotating_components = [ ].
If I train the model using the config.cfg file with above parameters co, I got the error message:
ValueError: [E203] If the tok2vec embedding layer is not updated during training, make sure to include it in 'annotating components'.
Approach 2: Add blank NER and save model
- After removing the NER component, I added a blank NER component via
nlp.add_pipe("ner", last=True)and saved the model asen_core_sci_scibert_empty_ner. Here, the model file does not contain anerfolder, and there is nonercomponent in the pipeline. -
- In the automatically generated
config.cfgfile foren_core_sci_scibert_empty_nermodel:
- In the automatically generated
pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser","ner"]
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"]
annotating_components = [ ]
If I train the model using the above parameters, I got the error message:
KeyError: "[E022] Could not find a transition with the name 'O' in the NER model."
My questions:
- Does my overall approach make sense? Which method is more appropriate? Is it necessary to add a blank NER component before training?
- In either case, should I add
"ner"toannotating_componentswhile keeping the other components frozen as listed? ( I tried different combinations of thepipeline,frozen_componentsandannotating_componentsarguments, but I encountered different error messages each time) - Can you help explain and resolve each error?
- Especially the
[E022]error: `Could not find a transition with the name 'O'.
Personally, I think adding a blank NER component makes more sense—without it, there's no ner folder in the model, which leads to:
FileNotFoundError: [Errno 2] No such file or directory: 'en_core_sci_scibert_without_ner/ner/moves'
I apologize for the verbose wording, and I sincerely appreciate any suggestions and help you can offer!