Fixing annotation mistakes (workflow question)

Hey everyone,

I noticed a mistake in our annotation data and was wondering what the easiest way to correct it would be. Say I wanted to change the label of the first span to something like ECON_ORG

output of prodigy print-dataset ner_economist_speeches2 | rg "US Treasury"

The former US Treasury ECONOMIST chief and top economist Larry Summers ECONOMIST recently said that Brexit will be remembered as a “historic economic error”, adding that he would be “very surprised” if the UK avoided a recession in the next two years. He also noted that the UK’s economic situation was “frankly more acute than in most other major countries”.

exporting the db to jsonl and changing it there and importing it again seems error prone, especially because the span text is sometimes missing in the jsonl for some reason (maybe related to this: Change some annotations for existing dataset )

I think I will just delete the entire row in the jsonl for now and reannotate the example but maybe there is a more elegant solution.

Hi @nikita!

The solution depends on whether it is just this particular string that needs to relabelled or the entire ECONOMIST label is "polutted".

If it's just "US Treasury" (or a small set of known surface forms):
A targeted patch via the database API is the cleanest thing. The DB stores examples as plain dicts so you can mutate the copy directly:

import copy
from prodigy.components.db import connect

db = connect()
src = "ner_economist_speeches2"
dst = "ner_economist_speeches2_fixed"

corrected = []
for eg in db.get_dataset_examples(src):
    eg = copy.deepcopy(eg)
    for span in eg.get("spans", []):
        surface = eg["text"][span["start"]:span["end"]]
        if surface == "US Treasury" and span["label"] == "ECONOMIST":
            span["label"] = "ECON_ORG"
    corrected.append(eg)

db.add_dataset(dst)
db.add_examples(corrected, datasets=[dst])

A few things going on here:

  • We reconstruct the surface form with eg["text"][span["start"]:span["end"]] instead of reading span["text"] is the surface form may not be reliable in your dataset. The character offsets start/end (along with label and the token offsets) are what the trainer actually uses — they're the source of truth. Any text field on a span is informational and built-in recipes do attach it on fresh annotations, but it's optional in the schema, so older datasets, custom recipes, or examples that round-tripped through ner_manual on Prodigy < 1.10.4 can legitimately be missing it. Slicing the text by offsets always works regardless.
  • We match on label too (span["label"] == "ECONOMIST") so if "US Treasury" ever shows up correctly tagged as e.g. ORG elsewhere, we don't touch that one.
  • Once you've verified the corrected dataset with prodigy print-dataset ner_economist_speeches2_fixed, you can either point training at the new name directly

If the whole ECONOMIST label is polluted (you don't know upfront which spans are wrong)
Then re-annotation is the right call, but you don't want to redo the entire dataset — only the examples that contain ECONOMIST, while keeping all the other labels untouched. To filter examples by span label, you can use a simple python script:

"""
Split a Prodigy NER dataset by a label that needs re-annotation.

After running:
  - <src>__polluted : examples containing at least one bad-label span
                      (all spans kept, for context)
  - <src>__cleaned  : every example, with the bad label stripped

Usage:
  python split_for_relabel.py ner_economist_speeches2 ECONOMIST
"""

import sys
from prodigy.components.db import connect


def split(src: str, bad_label: str) -> None:
    db = connect()
    if src not in db.datasets:
        raise SystemExit(f"Dataset {src!r} not found")

    polluted_name = f"{src}_polluted"
    cleaned_name = f"{src}_cleaned"
    for name in (polluted_name, cleaned_name):
        if name in db.datasets:
            raise SystemExit(
                f"{name!r} already exists — drop it first with `prodigy drop {name}`"
            )

    examples = db.get_dataset_examples(src)
    polluted, cleaned = [], []
    for eg in examples:
        spans = eg.get("spans") or []
        if any(s["label"] == bad_label for s in spans):
            polluted.append(dict(eg))
        cleaned_eg = dict(eg)
        cleaned_eg["spans"] = [s for s in spans if s["label"] != bad_label]
        cleaned.append(cleaned_eg)

    db.add_dataset(polluted_name)
    db.add_examples(polluted, datasets=[polluted_name])
    db.add_dataset(cleaned_name)
    db.add_examples(cleaned, datasets=[cleaned_name])

    print(f"{src}: {len(examples)} examples")
    print(f"  → {polluted_name}: {len(polluted)} (contain {bad_label})")
    print(f"  → {cleaned_name}: {len(cleaned)} ({bad_label} stripped)")


if __name__ == "__main__":
    if len(sys.argv) != 3:
        raise SystemExit("usage: python split_for_relabel.py <dataset> <bad_label>")
    split(sys.argv[1], sys.argv[2])

Then:

  1. split
    python split_for_relabel.py ner_economist_speeches2 ECONOMIST
    This will create two datasets: _polutted & _cleaned. _polutted will only contain the examples that contain the polutted labels(s) while _cleaned will contain all the other examples (excluding the polutted ones)
  2. relabel
    prodigy ner.manual ner_econ_relabeled dataset:ner_economist_speeches2_polluted --label PERSON,ORG,ECON_ORG,ECONOMIST...
    If you're sure they only two labels that need to be modified are ECON_ORG and ECONOMIST than specify just these as --labels
  3. train on cleaned-rest + relabeled-polluted
    prodigy train ./out --ner ner_economist_speeches2_cleaned,ner_econ_relabeled

The trainrecipe takes care of the merge (so does data-to-spacy). The merge is conflict-free because the spans are grouped by input hash and dedupe by (start, end, label). For polluted examples, _cleaned contributes everything except ECONOMIST and ner_econ_relabeled contributes everything plus ECON_ORG (with no ECONOMIST left), so the union is exactly what you want. For non-polluted examples, only _cleaned contributes and they're unchanged.

Heads up: don't reach for prodigy db-merge here — it's plain concatenation and would put two copies of the same input into the merged dataset with conflicting spans. Combining at training time via train or data-to-spacy is the right path.

Finally, if you'd rather not do any extra scripting you can re-annotate the polutted dataset and review to reconcile the conflicts manually.

  1. re-annotate the source texts with ECON_ORG / ECONOMIST only
    prodigy ner.manual ner_econ_org_reannotated ner_economist_speeches2 --label ECON_ORG,ECONOMIST
  2. review — surfaces every example where the two datasets disagree on a span,
    prodigy review ner_economist_speeches_rev ner_economist_speeches2,ner_econ_org_reannotated --view-id ner_manual
  3. train on the reviewed dataset
    prodigy train ./out --ner ner_economist_speeches_rev

prodigy review groups examples by input hash and shows the conflicting span versions side by side so the annotator can resolve them. The output is a single clean dataset you can train on directly.

Previous suggestions are more time efficient but require an extra scripting step to split and filter the dataset.