After many experiments, I found the problem!
I was using the ner.make-gold in a wrong way.
I thought I need to reject the incorrect NER predictions from the model after I made corrections to it.
After viewing the content of Prodigy SQLite database, it turns out Prodigy doesn’t record the original prediction from the model, only the corrected final result.
So basically my dataset record shows I rejected ~80% of the correct answers, and accepted ~20% of correct model predictions. The model struggles because it sees contradicted training data.
I deleted the entire SQLite, added a text preprocess step, and start over with ner.manual, it now has 90% accuracy after 100 iterations of training.
It seems the reject button really serves no purpose in ner.make-gold or ner.manual, maybe you could consider disable it to avoid confusion.