NER Trained Model Analysis

Hello,
I was able to train a spancat model and achieved an accuracy of 58%. I used Prodigy Train CLI command for training. After training, I now want to do further analysis on the result. Since its a NER task, I wish to find the accuracy per entity of my model and later if possible also want to visualize it. What is the best way to go about it?. I think there is an option for entity wise label stats in the prodigy train command but I do not want to run the train command again and train the model again since it is very time consuming.

hi @shabbirrafiq!

So yes, prodigy train has a --label-stats that will show a breakdown of per-label stats after training is completed.

But since you don't want to run it again, could you look at the meta.json file in your model path (e.g., model-best) and view the performance? This is standard for spaCy models.

For example (just ran on dummy data):

{
  "lang":"en",
  "name":"pipeline",
  "version":"0.0.0",
  "spacy_version":">=3.5.2,<3.6.0",
  "description":"",
  "author":"",
  "email":"",
  "url":"",
  "license":"",
  "spacy_git_version":"aea4a96f9",
  "vectors":{
    "width":0,
    "vectors":0,
    "keys":0,
    "name":null,
    "mode":"default"
  },
  "labels":{
    "tok2vec":[

    ],
    "ner":[
      "LOCATION",
      "ORG",
      "PERSON"
    ]
  },
  "pipeline":[
    "tok2vec",
    "ner"
  ],
  "components":[
    "tok2vec",
    "ner"
  ],
  "disabled":[

  ],
  "performance":{
    "ents_f":0.0,
    "ents_p":0.0,
    "ents_r":0.0,
    "ents_per_type":{
      "ORG":{
        "p":0.0,
        "r":0.0,
        "f":0.0
      },
      "PERSON":{
        "p":0.0,
        "r":0.0,
        "f":0.0
      }
    },
    "tok2vec_loss":10.4360037228,
    "ner_loss":5.3757577943
  }
}

Hope this helps!

Hello Ryan,
Thanks for the quick response. Attached below is the meta.json file output. It does not include entity wise accuracy. Can you suggest any other way? Is there any way I can run the prodigy train command again with this trained model. I think it would be faster. If there is any other way let me know too.

Ah, I think spancat may not provide label level in meta.json like ner.

Just curious - did you train with prodigy train or spacy train by first running data-to-spacy?

If you did the 2nd (which is one of the many benefits of using data-to-spacy + spacy train), you can use spacy evaluate like:

spacy evaluate my_model/model-best dev.spacy

where my_model/model-best is the location of your model and dev.spacy is out the location of your evaluation data in spaCy binary format (which is one of the outputs of data-to-spacy).

For example:

$ python -m spacy evaluate span_model/model-best dev.spacy
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

================================== Results ==================================

TOK      100.00
SPAN P   98.73 
SPAN R   93.98 
SPAN F   96.30 
SPEED    7111  


============================== SPANS (per type) ==============================

               P        R        F
state     100.00    87.50    93.33
address    95.24    95.24    95.24
city      100.00    96.00    97.96
zip       100.00   100.00   100.00

If you used prodigy train, the problem is for spacy evaluate you need the model and your dedicated evaluation dataset as inputs. prodigy train does not provide the dedicated evaluation dataset (only it may output the model)

1 Like

I used Prodigy train. I think its better if I train the model again. Thank you for the help

Hello @ryanwesslen. Thank you for your help. I tried your spacy way to get the label wise accuracy. Attached are the results.


It is a spancat problem and I have 4 entities in my dataset. The problem is that for the Entity Application Domain, I get 0 precision, recall and F1 score. Why is that happening and how can I fix it? Also I want accuracy of each label too, how can I get that?

Thank you

hi @shabbirrafiq,

Can you run data debug on your data?

This will run through several diagnostics on your annotation data as it can help debug problems with them. This is where using data-to-spacy (and thus spacy train, not prodigy train) as a first step would make this very easy.

In addition - have you tried to run through your training examples back through your model with spans.correct? This would be the best way for you to discover where your model isn't working correctly.

Or even better - if you could score your training data, save it to a new dataset, then you could use the review recipe to show both your annotations and the model's predictions. Then you can make the assessment to figure out what's going wrong with your model.

This is using spaCy's scorer. spaCy's scorer only shows Precision, Recall, and F1 as F1 is a better measurement for accuracy than raw accuracy to control for label imbalance.

See this post for more details:

Hello ryan. What do you mean to score your training data? Can you please elaborate this step and how should I do it?
I was able to run the data debug command. Attached are the results


It means it is able to identify Application Domain entity. Is it not working because the entity has less examples in the training set?

hi @shabbirrafiq,

Sorry for the confusion. By "scoring" your training, I simply mean "make a prediction" on the training data. For example, you give it a sentence from your training data, and save your model's predictions of what spans exist into a Prodigy dataset.

Essentially doing the same as what spans.correct does, but just putting it into a dataset without human review.

Are you using Prodigy v1.12? (If not, most of the recipe will still work - but you'll need to follow our pre-v1.12 way to handle streams).

For v1.12+, I sketched out a script you can use:

import copy

from prodigy.components.stream import get_stream
from prodigy.components.preprocess import add_tokens
from prodigy.components.db import connect

import spacy
import srsly

def make_tasks(nlp, stream, labels):
    """Add a 'spans' key to each example, with predicted entities."""
    texts = ((eg["text"], eg) for eg in stream)
    for doc, eg in nlp.pipe(texts, as_tuples=True, batch_size=10):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:
            if labels and ent.label_ not in labels:
                continue
            spans.append(
                {
                    "token_start": ent.start,
                    "token_end": ent.end - 1,
                    "start": ent.start_char,
                    "end": ent.end_char,
                    "text": ent.text,
                    "label": ent.label_
                }
            )
        task["spans"] = spans
        task["answer"] = "accept" # set answer as accept for all
        task["_view_id"] = "ner_manual" # set as using ner_manual view
        yield task

examples = srsly.read_jsonl("news_headlines.jsonl")
nlp = spacy.load("en_core_web_sm")
labels = ["ORG","PERSON","LOCATION"]

stream = get_stream(
    examples, rehash=True, dedup=True, input_key="text"
)
stream.apply(add_tokens, nlp=nlp, stream=stream)
stream.apply(make_tasks, nlp=nlp, stream=stream, labels=labels)

db = connect()
db.add_dataset("scored_ner_dataset") 
db.add_examples(stream, ["scored_ner_dataset"])

I don't have a spancat model ready to use, so I used a ner component to illustrate this example.

A few things you'll need to change:

  • task["_view_id"] = "ner_manual" to task["_view_id"] = "spans_manual"
  • "news_headlines.jsonl" to your training data.
  • nlp = spacy.load("en_core_web_sm") to load instead your pipeline
  • labels = ["ORG","PERSON","LOCATION"] to the labels you will be using
  • "scored_ner_dataset" to a unique name for your new dataset (e.g,. maybe scored_spancat_dataset)

This will then put "scored" (aka examples now with model predictions) into the dataset: scored_ner_dataset.

Now, with the additional labeled data you have -- let's call it ner_dataset (although yours will be for spancat), you can run the review recipe to compare both annotations: your model's prediction (scored_ner_dataset) and your actual annotations (ner_dataset).

You can do this by running:

python -m prodigy review reviewed scored_ner_dataset,ner_dataset

The top is a normal annotation that you can run to reconcile the examples. That is your annotation can be a "gold standard" example.

In this example, maybe after reviewing, I found my model didn't find any of the two spans I wanted, while my past annotation (ner_dataset-ryan) only captured one of the entities I wanted (e.g., maybe I had an error on this example). So I manually choose my entities (spans) and save this into the reviewed dataset. I could then use this reconciled data.

Just FYI - this is only a suggestion. You could likely move much faster simply using spans.correct to see the model prediction. But the downside of that is that you cannot in the same interface also see your actual training annotations.

I don't know. That's why you'll need to review some examples like in spans.correct or like how I show you above.

The good news is that your spans have pretty good characteristics that are generally the issue when training a spancat model.

I was actually loading the wrong model in the spacy evaluate command. Have fixed it and everything is working fine now. Thank you very much for the help @ryanwesslen

1 Like