Dear Prodigy team:
I'm training a custom NER model using the en_core_sci_scibert
base model to identify a new entity label not included in the original model. I have already annotated my training data using ner.manual
, relying on en_core_sci_scibert
for tokenization. To train the new NER model, I removed the original NER component using nlp.remove_pipe("ner")
in Python. Then, I tried two different approaches:
Approach 1: Remove NER and save model
- After removing the NER component, I saved the model as
en_core_sci_scibert_without_ner
. Here, I have aner
folder in the model directory and aner
component in the pipeline. - In the automatically generated
config.cfg
file foren_core_sci_scibert_without_ner
model:
pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser"],
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"] annotating_components = [ ].
If I train the model using the config.cfg file with above parameters co, I got the error message:
ValueError: [E203] If the tok2vec embedding layer is not updated during training, make sure to include it in 'annotating components'.
Approach 2: Add blank NER and save model
- After removing the NER component, I added a blank NER component via
nlp.add_pipe("ner", last=True)
and saved the model asen_core_sci_scibert_empty_ner
. Here, the model file does not contain aner
folder, and there is noner
component in the pipeline. -
- In the automatically generated
config.cfg
file foren_core_sci_scibert_empty_ner
model:
- In the automatically generated
pipeline = ["transformer","tagger","attribute_ruler","lemmatizer","parser","ner"]
frozen_components = ["transformer","parser","tagger","attribute_ruler","lemmatizer"]
annotating_components = [ ]
If I train the model using the above parameters, I got the error message:
KeyError: "[E022] Could not find a transition with the name 'O' in the NER model."
My questions:
- Does my overall approach make sense? Which method is more appropriate? Is it necessary to add a blank NER component before training?
- In either case, should I add
"ner"
toannotating_components
while keeping the other components frozen as listed? ( I tried different combinations of thepipeline
,frozen_components
andannotating_components
arguments, but I encountered different error messages each time) - Can you help explain and resolve each error?
- Especially the
[E022]
error: `Could not find a transition with the name 'O'.
Personally, I think adding a blank NER component makes more sense—without it, there's no ner
folder in the model, which leads to:
FileNotFoundError: [Errno 2] No such file or directory: 'en_core_sci_scibert_without_ner/ner/moves'
I apologize for the verbose wording, and I sincerely appreciate any suggestions and help you can offer!