So, I have trained the 'en_core_web_md' model of Spacy to identify diseases and named this new entity "MEDICAL". (I have used 500 sentences having diseases mentioned in it.)
The steps I performed to train Spacy are mentioned below,
prodigy dataset medical_entity "Seed terms for MEDICAL label"
prodigy terms.teach medical_entity en_core_web_md --seeds "cough, allergy, asthma, constipation, dehydration"
prodigy terms.to-patterns medical_entity med_patterns.jsonl --label MEDICAL
prodigy ner.manual medical_entity en_core_web_md med_test2.jsonl --label MEDICAL
(med_test2.jsonl contains 500 sentences with medical entities)
prodigy ner.make-gold medical_entity en_core_web_md "suffering" --api twitter --label MEDICAL,PERSON,ORG,LOC,DATE,NORP -U
prodigy ner.batch-train medical_entity en_core_web_md --output with-medical-model --label MEDICAL,PERSON,ORG,LOC,DATE,NORP --eval-split 0.2 --n-iter 6 --batch-size 8 --no-missing
The batch-train command calculates a 57 % accuracy and does not mark diseases accurately when I perform prodigy ner.print-stream with-medical-model "suffering" --api twitter.
Is less data the reason behind this ? or I have made some mistakes in the above steps ?
I also want to understand how Spacy predicts entities. If I train Spacy with diseases like cancer, cold, fever etc, how will it predict the untrained diseases ? (based on the context information in the sentence ?)
For example , If my model is trained with thousands of sentences with thousands of diseases present in it, will the model predict any new untrained diseases ? If yes, how ?