ner.correct annotations with custom NER model

sennierer · November 3, 2020, 10:53am

Hi, I retrained the named entity recogniser to add a new entity (ACTIVITY) and added some custom components to the pipeline to improve the outcome of the NER later on. I packaged the model (including the factories for the custom components) and installed the packages in another venv. Everything works fine when I only use spacy or visualize the entities with displacy. However, when I use the model in prodigy (ner.correct) I get shifted annotations.
As prodigy shows the same sequence of annotations, but always only one word I suspect that somehow the merging of the tokens gets lost when I use the model in prodigy....but as it runs fine with spacy.load() I have no idea how to fix that.

gueugaie · November 3, 2020, 1:53pm

Not sure this is the answer you're looking for but I noticed that prodigy rendering is sensible to the order of entities. So please make sure you task's have their spans sorted in ascending token order. For good measure, you probably should make sure too that the tokens do too.

sennierer · November 3, 2020, 3:31pm

The thing is that I am using the prodigy ner.correct recipe with the custom model (including the custom pipeline components) on not annotated text. If I use the model on the same blank text to annotate the NEs and visualize the outcome with displacy it performs really well. But in the ner.correct recipe it doesnt. If you compare the two screenshots you will recognize that the sequence of annotations is the same, but the extend is not. Prodigy annotates always only one word, while in the displacy rendering of the annotations you will see that most of the annotations are multi word.
One of the custom components that I added to the pipeline merges tokens that belong to the same entity. So ent.end is always ent.start+1. I suspect that prodigy (I use v1.10.2) either uses its own tokenizer and therefore gets the annotations that come from the model wrong, or I did something wrong when I created the pipeline that only breaks when I use it with prodigy.

ines · November 3, 2020, 5:22pm

Ah, that would explain what's going on, yes! Prodigy calls into on nlp.make_doc to generate a tokenized Doc object (which typically delegates to nlp.tokenizer) and doesn't run the full pipeline just to tokenize, which is usually unnecessarily expensive.

The best solution would probably be to just write your own adaptation of the ner.correct recipe, e.g. based on this template:

github.com

explosion/prodigy-recipes/blob/master/ner/ner_make_gold.py

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string, set_hashes
import spacy
import copy
from typing import List, Optional


def make_tasks(nlp, stream, labels):
    """Add a 'spans' key to each example, with predicted entities."""
    # Process the stream using spaCy's nlp.pipe, which yields doc objects.
    # If as_tuples=True is set, you can pass in (text, context) tuples.
    texts = ((eg["text"], eg) for eg in stream)
    for doc, eg in nlp.pipe(texts, as_tuples=True):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:
            # Continue if predicted entity is not selected in labels
            if labels and ent.label_ not in labels:

This file has been truncated. show original

But instead of calling add_tokens, you'd just add a "tokens" property to the task using the processed doc object with named entities that's also used to add the entity suggestions. For example:

def get_token(token):
    return {
        "text": token.text,
        "start": token.idx,
        "end": token.idx + len(token.text),
        "id": token.idx,
        "ws": bool(token.whitespace_),
    }

# in the make_tasks helper
task["tokens"] = [get_token(token) for token in doc]

sennierer · November 9, 2020, 11:07am

That was the solution! Many thanks Ines!

ines · November 9, 2020, 1:09pm

Cool, glad it worked!

Btw, here's how I'm thinking about solving this going forward with spaCy v3: pipeline components in v3 are registered using the Language.component or Language.factory decorators that allow specifying additional meta information about the component, like the annotations it sets and requires (for pipeline analysis) and whether it retokenizes. So you could then mark your custom component that merges entities with retokenizes=True and Prodigy could check this and default to running the whole pipeline to produce a tokenized doc. If no components retokenize, it would use the more efficient make_doc.

sennierer · November 9, 2020, 1:22pm

That would of course be a very elegant solution. Looking forward to these changes.
Thanks for the update!

Topic		Replies	Views
ner correct with prodigy 1.11.8 ner	11	533	December 30, 2022
How to use customized spaCy model in Prodigy? ner , spacy	6	491	July 3, 2023
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Prodigy with AllenNLP model usage , allennlp	3	919	March 4, 2021
Roadmap of having a unified model for tokenizing, NER and dependency parsing using Prodigy ner , spacy , custom , training	1	418	July 7, 2023

ner.correct annotations with custom NER model

Related topics