train a Spacy 'en_core_web_md' manually using ner.manual

I tried training the Spacy's 'en_core_web_md' model to identify diseases using ner.teach and twitter, times now, guardian API's.
But the model has a very low accuracy of 40% and marks literally anything as a 'DISEASE' entity.
terms like to, from, She, comma, colon are also predicted as a 'DISEASE' entity.

Not sure why this happens. But I have a feeling that the problem might be because of the dataset having very less number of sentences having diseases and more sentences with NO diseases.

Now, I am planning to create my own dataset and I need some guidance here.

  1. How many sentences will I need to train having diseases ?
  2. I want to train the model manually using ner.manual. How do I do it ?


Hi! You might find the NER annotation flowchart helpful, which should answer your main questions and give you some inspiration for what to try:

It might make sense to start off with a blank model instead of the pre-trained NER component of the en_core_web_md model. If there are entity types you want to keep (like PERSON), you can use the ner.make-gold recipe with the existing labels plus DISEASE. This will pre-highlight the existing predictions and lets you correct them, and manually add the annotations for your new DISEASE label. When you train your model later on, make sure to set the --no-missing flag to tell spaCy that the annotated spans are complete and unannotated tokens are not part of any entity.