Detailed evaluation of NER model trained from Prodigy annotations

lievcin · December 10, 2021, 7:08pm

Hi everyone,
I have been using Prodigy for a couple of weeks now and finding it extremely useful and intuitive. The ability to train spaCy models directly is also a really nice feature.
I am, however, a little stuck in my process of annotation/validation of models predictions. I follow the following flow:

Create a few terms, use sense2vec to enrich that list
Make a few annotations
Train model
Iterate with more annotations and further training.

However, I'm finding it quite hard to evaluate the model beyond the top line metrics of training. Usually, if this were a scikit-learn model, I might be able to load the ground truth labels, score using the model, and then explore the records where there is a mismatch. This is usually helpful to understand if the model might be actually picking up additional positive examples I didn't label.

I can't seem to find a straight forward way of doing a similar analysis in either Prodigy or spaCy. So my eyeball eval flow looks like:

db-export dataset with gold annotations
load into spaCy trained model
iterate over exported annotations and load the "text" into spaCy Doc objects, extract the entities.
extract entities from annotations
load both entities from annotations and from model predictions to dataframe.

Then I can inspect the results and see where the model is either predicting additional valid examples, which examples is finding hard to match, etc.

However, the above seems rather an involved way of carrying out the process, maybe there's a better way?
Any help is really appreciated!

Thank you!

ines · December 13, 2021, 11:27am

Hi! If your goal is to put together a dataframe and explore the predictions, a simpler solution would be to run data-to-spacy to export your annotations in spaCy's format. You can use this for training, and also for easy access to the annotations.

Under the hood, the .spacy files are just collections of Doc objects. So if you load them back in, you get a Doc object with doc.ents, just like you'd get from running a model over your text: DocBin · spaCy API Documentation You can then compare those entities to the predictions by one or more models on the same text, and store the results or differences in a dataframe.

If you prefer a more visual approach, you could also build a little annotation workflow that loads in your existing annotations and adds entries to the "spans" for the model's predictions, e.g. using different labels like model:ORG and data:ORG etc., maybe even with different custom colours for the different label types. If you use the spans_manual UI to render it, you can view multiple overlapping spans and view how the predictions compare to the original data. Even if you just skip through the results and don't actually annotate anything, it could be a nice way to visualize the results.

lievcin · December 13, 2021, 12:16pm

Hi Ines,
Thank you for your response, this is indeed very helpful. I was looking into the Example and Scorer from spaCy and now that you mentioned the data-to-spacy being a collection of Docs maybe indeed I could use to build Examples.
Re visual approach, this is a neat suggestion, is there an example from docs re how a custom workflow can be built? Do you mean by this a custom recipe?
Appreciate the help and thanks again for your helpful response

ines · December 13, 2021, 12:31pm

Yes, exactly! You can read more about custom recipes here: Custom Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP I think this recipe should be a really good starting point and it already does most of what you want:

github.com

explosion/prodigy-recipes/blob/master/ner/ner_make_gold.py

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string, set_hashes
import spacy
import copy
from typing import List, Optional


def make_tasks(nlp, stream, labels):
    """Add a 'spans' key to each example, with predicted entities."""
    # Process the stream using spaCy's nlp.pipe, which yields doc objects.
    # If as_tuples=True is set, you can pass in (text, context) tuples.
    texts = ((eg["text"], eg) for eg in stream)
    for doc, eg in nlp.pipe(texts, as_tuples=True):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:
            # Continue if predicted entity is not selected in labels
            if labels and ent.label_ not in labels:

This file has been truncated. show original

The only difference in your case would be that you want to use "view_id": "spans_manual" to support overlapping spans. And instead of resetting the spans = [], you'd add the predicted spans on top of the spans that are already annotated in the input data.

To distinguish the predicted spans, you could use a label like f"MODEL:{ent.label_}" (or maybe just M:, since that's shorter). If you want it to look fancier, you can also add some custom label colours for the different labels in your data, e.g. one version for MODEL:{label} and one for the regular label: Web Application · Prodigy · An annotation tool for AI, Machine Learning & NLP So your annotated labels could be blue, and the predicted labels could be red, or something like that

lievcin · December 14, 2021, 5:10pm

Hi Ines,
Thank you for the help! I now have a recipe that indeed shows me both the predicted labels and my annotations

Only one outstanding point is the dataset parameter for the recipe. Since the recipe doesn't need to save new annotations, I was wondering if there's a way to boot up without the dataset, but when I remove all references to it from the recipe, I get this exception, which seems to be coming from the library itself.

✘ Invalid components returned by recipe 'ner.model-evaluation'

dataset   field required

{'view_id': 'spans_manual', 'stream': <generator object make_tasks at 0x13dbc4ac0>, 'config': {'lang': 'en', 'labels': ['DATA_ENTRY']}}

Seems like a pain to have to have a dummy dataset ¯_(ツ)_/¯

ines · December 14, 2021, 5:13pm

Awesome, thanks for updating

If you don't want to save anything to a dataset, you can just set "dataset": False explicitly – sorry, this might be slightly under-documented at the moment because it's a fairly rare use case.

lievcin · December 14, 2021, 5:23pm

that works nicely!

Re documentation, agree, not common case. I'm wondering myself whether this inspection will not actually make me want to add more cases that perhaps the model is identifying where I lacked the labels.
The dataset will be back

Thanks for the help and awesome libraries!

Topic		Replies	Views
How do I use prodigy as a purely annotation tool with no underlying SpaCy model? usage	1	1595	April 27, 2018
Formatting Prodigy annotations for evaluation of external NER models using spaCy usage , ner , spacy	4	607	April 13, 2022
Create baseline metrics based on manual NER annotations usage , ner , solved	3	674	June 8, 2020
Prodigy to Spacy Guide ner , spacy , best-practices	4	5341	January 13, 2020
Recipe for comparing NER model and manual annotation usage , ner , custom , compare	4	1421	July 13, 2021

Detailed evaluation of NER model trained from Prodigy annotations

Related topics