So, I have trained the 'en_core_web_md' model of Spacy to identify diseases and named this new entity "MEDICAL". (I have used 500 sentences having diseases mentioned in it.)
The steps I performed to train Spacy are mentioned below,
prodigy dataset medical_entity "Seed terms for MEDICAL label"
The batch-train command calculates a 57 % accuracy and does not mark diseases accurately when I perform prodigy ner.print-stream with-medical-model "suffering" --api twitter.
Is less data the reason behind this ? or I have made some mistakes in the above steps ?
I also want to understand how Spacy predicts entities. If I train Spacy with diseases like cancer, cold, fever etc, how will it predict the untrained diseases ? (based on the context information in the sentence ?)
For example , If my model is trained with thousands of sentences with thousands of diseases present in it, will the model predict any new untrained diseases ? If yes, how ?
@ameyn21 If you want the full in-depth explanation of the current NER model implementation, this video by @honnibal should be helpful:
The short and simplified version is: based on the very local context of the surrounding words, the model predicts whether a sequence of tokens is an entity. The prediction is made using the weights you've trained on your data, so the data should be representatitve of what the model will see at runtime. This allows it to generalise predict entities that it hasn't seen during training.
Your workflow looks fine and you seem to be doing everything right: you're including the entity types that the model previously got right and you're training with --no-missing. However, I think I might know what the problem is:
For both annotation runs, you're saving the data to the same dataset (medical_entity) and then training with --no-missing on the set containing all annotations. So your dataset contains two types of annotations: the ones with all labels created with ner.make-gold and the ones where you only labelled MEDICAL. Since you're training with --no-missing, all tokens that are not annotated will be treated as "outside an entity", which makes sense for the second portion of the data. But for the first portion of the data where you only annotated MEDICAL, this is problematic, because that data might very well contain persons or organisations.
So bascially, during training, your model sees a bunch of examples where things like person names are labelled PERSON, and a bunch of examples where very similar spans are supposed to be not an entity. It then tries to make sense of that and train weights to represent this, which is pretty difficult and will likely fail. And then you're also evaluating on a portion of that shuffled data where sometimes a span is labelled (because it's from the second annotation run) and sometimes it's not (because it's from the first). This likely explains the bad accuracy you're seeing.