Review examples where the model fails to predict correctly

Did somebody come up with a convenient recipe command to view all the examples where the NER model predicts wrong. I'd like to see if there are some systematic wrong behavior - could imagine a lot of people wanted to that as well at some point.

I know I can just write a small recipe that prints to the console in some convenient way but maybe somebody already made that (or similar)?

@explosion: do you recommend using prodigy for this kind of task - no feedback needed, just evaluation? Or do you recommend using some other tool like streamlit or similar (in case the print-to-terminal does not suffice)?

Just to make sure I understand the question correctly, you mean on the evaluation data, right? This should be pretty straightforward to implement because you'd just need to run your trained model over the examples and then compare the predicted doc.ents to the annotated spans (even as basic as comparing the (start, end, label) of the entity spans). Or, if you wanted it to be fancier, you could also check if it's a false positive/negative or if just the label is wrong.

Using Prodigy could work here, especially if you want to click through the examples – you could even use blocks and a text input and then leave some notes for yourself on the particularly interesting examples. Or add multiple choice options to categorize the types of mistakes (kinda similar to the evaluation recipe I built for the image captioning tutorial). If you have the start/end/label data, rendering the examples in Prodigy will be super easy and it's probably one of the quickest ways to get something onto the screen.

If you want to use Streamlit, this NER data visualizer I built for one of the tutorials might be a helpful starting point:

1 Like

That's a really good idea thanks. Is it possible to show two different spans (one for each block)? My current recipe looks like this

def review_ner(eval_dataset: str, spacy_model: str, review_dataset: str):
    db = connect()

    if eval_dataset not in db:"Can't find dataset '{eval_dataset}' in database", exits=1)

    nlp = spacy.load(spacy_model)
    examples = db.get_dataset(eval_dataset)

    def get_stream():
        for eg in examples:
            doc = nlp(eg["text"])
            eval_spans = [
                    "start": span["start"],
                    "end": span["end"],
                    "label": span["label"],
                    "text": eg["text"][span["start"] : span["end"]],
                for span in eg["spans"]
            pred_spans = [
                    "start": ent.start_char,
                    "end": ent.end_char,
                    "label": ent.label_,
                    "text": ent.text,
                for ent in doc.ents
                if ent.label_ == "PERIOD"
            yield_example = {
                "text": eg["text"],
                "eval_spans": eval_spans,
                "spans": pred_spans,
            wrong_pred = False
            if len(pred_spans) != len(eval_spans):
                wrong_pred = True
            for eval_span, pred_span in zip(eval_spans, pred_spans):
                if eval_span != pred_span:
                    wrong_pred = True

            if wrong_pred:
                yield yield_example

    blocks = [
        # {"view_id": "ner", "spans": "eval_spans"},
        {"view_id": "ner"},
            "view_id": "text_input",
            "field_rows": 3,
            "field_label": "Any comments?",

    return {
        "dataset": review_dataset,
        "view_id": "blocks",
        "stream": get_stream(),
        "config": {
            "blocks": blocks,

It would be cool if I could show the correct spans and the predicted spans. Is something like this possible?

{"view_id": "ner", "spans": "eval_spans"}