Fixing annotation mistakes (workflow question)

nikita · May 5, 2026, 1:14pm

Hey everyone,

I noticed a mistake in our annotation data and was wondering what the easiest way to correct it would be. Say I wanted to change the label of the first span to something like ECON_ORG

output of prodigy print-dataset ner_economist_speeches2 | rg "US Treasury"

The former US Treasury ECONOMIST chief and top economist Larry Summers ECONOMIST recently said that Brexit will be remembered as a “historic economic error”, adding that he would be “very surprised” if the UK avoided a recession in the next two years. He also noted that the UK’s economic situation was “frankly more acute than in most other major countries”.

exporting the db to jsonl and changing it there and importing it again seems error prone, especially because the span text is sometimes missing in the jsonl for some reason (maybe related to this: Change some annotations for existing dataset )

I think I will just delete the entire row in the jsonl for now and reannotate the example but maybe there is a more elegant solution.

magdaaniol · May 6, 2026, 9:21am

Hi @nikita!

The solution depends on whether it is just this particular string that needs to relabelled or the entire ECONOMIST label is "polutted".

If it's just "US Treasury" (or a small set of known surface forms):
A targeted patch via the database API is the cleanest thing. The DB stores examples as plain dicts so you can mutate the copy directly:

import copy
from prodigy.components.db import connect

db = connect()
src = "ner_economist_speeches2"
dst = "ner_economist_speeches2_fixed"

corrected = []
for eg in db.get_dataset_examples(src):
    eg = copy.deepcopy(eg)
    for span in eg.get("spans", []):
        surface = eg["text"][span["start"]:span["end"]]
        if surface == "US Treasury" and span["label"] == "ECONOMIST":
            span["label"] = "ECON_ORG"
    corrected.append(eg)

db.add_dataset(dst)
db.add_examples(corrected, datasets=[dst])

A few things going on here:

We reconstruct the surface form with eg["text"][span["start"]:span["end"]] instead of reading span["text"] is the surface form may not be reliable in your dataset. The character offsets start/end (along with label and the token offsets) are what the trainer actually uses — they're the source of truth. Any text field on a span is informational and built-in recipes do attach it on fresh annotations, but it's optional in the schema, so older datasets, custom recipes, or examples that round-tripped through ner_manual on Prodigy < 1.10.4 can legitimately be missing it. Slicing the text by offsets always works regardless.
We match on label too (span["label"] == "ECONOMIST") so if "US Treasury" ever shows up correctly tagged as e.g. ORG elsewhere, we don't touch that one.
Once you've verified the corrected dataset with prodigy print-dataset ner_economist_speeches2_fixed, you can either point training at the new name directly

If the whole ECONOMIST label is polluted (you don't know upfront which spans are wrong)
Then re-annotation is the right call, but you don't want to redo the entire dataset — only the examples that contain ECONOMIST, while keeping all the other labels untouched. To filter examples by span label, you can use a simple python script:

"""
Split a Prodigy NER dataset by a label that needs re-annotation.

After running:
  - <src>__polluted : examples containing at least one bad-label span
                      (all spans kept, for context)
  - <src>__cleaned  : every example, with the bad label stripped

Usage:
  python split_for_relabel.py ner_economist_speeches2 ECONOMIST
"""

import sys
from prodigy.components.db import connect


def split(src: str, bad_label: str) -> None:
    db = connect()
    if src not in db.datasets:
        raise SystemExit(f"Dataset {src!r} not found")

    polluted_name = f"{src}_polluted"
    cleaned_name = f"{src}_cleaned"
    for name in (polluted_name, cleaned_name):
        if name in db.datasets:
            raise SystemExit(
                f"{name!r} already exists — drop it first with `prodigy drop {name}`"
            )

    examples = db.get_dataset_examples(src)
    polluted, cleaned = [], []
    for eg in examples:
        spans = eg.get("spans") or []
        if any(s["label"] == bad_label for s in spans):
            polluted.append(dict(eg))
        cleaned_eg = dict(eg)
        cleaned_eg["spans"] = [s for s in spans if s["label"] != bad_label]
        cleaned.append(cleaned_eg)

    db.add_dataset(polluted_name)
    db.add_examples(polluted, datasets=[polluted_name])
    db.add_dataset(cleaned_name)
    db.add_examples(cleaned, datasets=[cleaned_name])

    print(f"{src}: {len(examples)} examples")
    print(f"  → {polluted_name}: {len(polluted)} (contain {bad_label})")
    print(f"  → {cleaned_name}: {len(cleaned)} ({bad_label} stripped)")


if __name__ == "__main__":
    if len(sys.argv) != 3:
        raise SystemExit("usage: python split_for_relabel.py <dataset> <bad_label>")
    split(sys.argv[1], sys.argv[2])

Then:

split
python split_for_relabel.py ner_economist_speeches2 ECONOMIST
This will create two datasets: _polutted & _cleaned. _polutted will only contain the examples that contain the polutted labels(s) while _cleaned will contain all the other examples (excluding the polutted ones)
relabel
prodigy ner.manual ner_econ_relabeled dataset:ner_economist_speeches2_polluted --label PERSON,ORG,ECON_ORG,ECONOMIST...
If you're sure they only two labels that need to be modified are ECON_ORG and ECONOMIST than specify just these as --labels
train on cleaned-rest + relabeled-polluted
prodigy train ./out --ner ner_economist_speeches2_cleaned,ner_econ_relabeled

The trainrecipe takes care of the merge (so does data-to-spacy). The merge is conflict-free because the spans are grouped by input hash and dedupe by (start, end, label). For polluted examples, _cleaned contributes everything except ECONOMIST and ner_econ_relabeled contributes everything plus ECON_ORG (with no ECONOMIST left), so the union is exactly what you want. For non-polluted examples, only _cleaned contributes and they're unchanged.

Heads up: don't reach for prodigy db-merge here — it's plain concatenation and would put two copies of the same input into the merged dataset with conflicting spans. Combining at training time via train or data-to-spacy is the right path.

Finally, if you'd rather not do any extra scripting you can re-annotate the polutted dataset and review to reconcile the conflicts manually.

re-annotate the source texts with ECON_ORG / ECONOMIST only
prodigy ner.manual ner_econ_org_reannotated ner_economist_speeches2 --label ECON_ORG,ECONOMIST
review — surfaces every example where the two datasets disagree on a span,
prodigy review ner_economist_speeches_rev ner_economist_speeches2,ner_econ_org_reannotated --view-id ner_manual
train on the reviewed dataset
prodigy train ./out --ner ner_economist_speeches_rev

prodigy review groups examples by input hash and shows the conflicting span versions side by side so the annotator can resolve them. The output is a single clean dataset you can train on directly.

Previous suggestions are more time efficient but require an extra scripting step to split and filter the dataset.

nikita · May 6, 2026, 3:18pm

Thank you so much for the detailed explaination @magdaaniol! This is super helpful and has also given me a great starting point for separating my annotation set into German and English language training datasets later (my annotation data has metadata I can filter on).

Topic		Replies	Views
correcting bad labels for NER with Jupyter and prodigy usage , ner	2	585	December 13, 2022
Renaming labels in NER usage , ner , database , solved	6	1641	November 15, 2022
How to modify dataset marked?	1	198	March 20, 2023
Correction of manually labeled relation usage , ner , database , solved , relations	3	375	October 25, 2021
Best way to re-label / re-annotate existing data based on condition ner	1	449	September 19, 2022

Fixing annotation mistakes (workflow question)

Related topics