I tried training the Spacy's 'en_core_web_md' model to identify diseases using ner.teach and twitter, times now, guardian API's.
But the model has a very low accuracy of 40% and marks literally anything as a 'DISEASE' entity.
terms like to, from, She, comma, colon are also predicted as a 'DISEASE' entity.
Not sure why this happens. But I have a feeling that the problem might be because of the dataset having very less number of sentences having diseases and more sentences with NO diseases.
Now, I am planning to create my own dataset and I need some guidance here.
- How many sentences will I need to train having diseases ?
- I want to train the model manually using ner.manual. How do I do it ?