Improve accuracy of the Spacy model

So, I have trained the 'en_core_web_md' model of Spacy to identify diseases and named this new entity "MEDICAL". (I have used 500 sentences having diseases mentioned in it.)
The steps I performed to train Spacy are mentioned below,

  1. prodigy dataset medical_entity "Seed terms for MEDICAL label"

  2. prodigy terms.teach medical_entity en_core_web_md --seeds "cough, allergy, asthma, constipation, dehydration"

  3. prodigy medical_entity med_patterns.jsonl --label MEDICAL

  4. less med_patterns.jsonl

  5. prodigy ner.manual medical_entity en_core_web_md med_test2.jsonl --label MEDICAL
    (med_test2.jsonl contains 500 sentences with medical entities)

  6. prodigy ner.make-gold medical_entity en_core_web_md "suffering" --api twitter --label MEDICAL,PERSON,ORG,LOC,DATE,NORP -U

  7. prodigy ner.batch-train medical_entity en_core_web_md --output with-medical-model --label MEDICAL,PERSON,ORG,LOC,DATE,NORP --eval-split 0.2 --n-iter 6 --batch-size 8 --no-missing

The batch-train command calculates a 57 % accuracy and does not mark diseases accurately when I perform prodigy ner.print-stream with-medical-model "suffering" --api twitter.

Is less data the reason behind this ? or I have made some mistakes in the above steps ?

I also want to understand how Spacy predicts entities. If I train Spacy with diseases like cancer, cold, fever etc, how will it predict the untrained diseases ? (based on the context information in the sentence ?)
For example , If my model is trained with thousands of sentences with thousands of diseases present in it, will the model predict any new untrained diseases ? If yes, how ?

Hi @ameyn21,

I can't comment on all of your spaCy steps, but I think you don't have enough data for what you're trying to do.

If you're training a new entity type, I recommend checking out the flowchart that @ines put together. You can find it here: Annotation Flowchart: Named Entity Recognition

The first item on the flowchart says if you don't have > 1000 items, gather more data :slight_smile:

Thanks @justindujardin. I think thats the only problem.
But I also want to know how Spacy's NER works internally. How does it predict entities?

@ameyn21 If you want the full in-depth explanation of the current NER model implementation, this video by @honnibal should be helpful:

The short and simplified version is: based on the very local context of the surrounding words, the model predicts whether a sequence of tokens is an entity. The prediction is made using the weights you've trained on your data, so the data should be representatitve of what the model will see at runtime. This allows it to generalise predict entities that it hasn't seen during training.

Your workflow looks fine and you seem to be doing everything right: you're including the entity types that the model previously got right and you're training with --no-missing. However, I think I might know what the problem is:

For both annotation runs, you're saving the data to the same dataset (medical_entity) and then training with --no-missing on the set containing all annotations. So your dataset contains two types of annotations: the ones with all labels created with ner.make-gold and the ones where you only labelled MEDICAL. Since you're training with --no-missing, all tokens that are not annotated will be treated as "outside an entity", which makes sense for the second portion of the data. But for the first portion of the data where you only annotated MEDICAL, this is problematic, because that data might very well contain persons or organisations.

So bascially, during training, your model sees a bunch of examples where things like person names are labelled PERSON, and a bunch of examples where very similar spans are supposed to be not an entity. It then tries to make sense of that and train weights to represent this, which is pretty difficult and will likely fail. And then you're also evaluating on a portion of that shuffled data where sometimes a span is labelled (because it's from the second annotation run) and sometimes it's not (because it's from the first). This likely explains the bad accuracy you're seeing.

Thanks @ines !

That explains my problem. I will collect more data and train the model more accurately.