Hi @nikita!
The solution depends on whether it is just this particular string that needs to relabelled or the entire ECONOMIST label is "polutted".
If it's just "US Treasury" (or a small set of known surface forms):
A targeted patch via the database API is the cleanest thing. The DB stores examples as plain dicts so you can mutate the copy directly:
import copy
from prodigy.components.db import connect
db = connect()
src = "ner_economist_speeches2"
dst = "ner_economist_speeches2_fixed"
corrected = []
for eg in db.get_dataset_examples(src):
eg = copy.deepcopy(eg)
for span in eg.get("spans", []):
surface = eg["text"][span["start"]:span["end"]]
if surface == "US Treasury" and span["label"] == "ECONOMIST":
span["label"] = "ECON_ORG"
corrected.append(eg)
db.add_dataset(dst)
db.add_examples(corrected, datasets=[dst])
A few things going on here:
- We reconstruct the surface form with
eg["text"][span["start"]:span["end"]] instead of reading span["text"] is the surface form may not be reliable in your dataset. The character offsets start/end (along with label and the token offsets) are what the trainer actually uses — they're the source of truth. Any text field on a span is informational and built-in recipes do attach it on fresh annotations, but it's optional in the schema, so older datasets, custom recipes, or examples that round-tripped through ner_manual on Prodigy < 1.10.4 can legitimately be missing it. Slicing the text by offsets always works regardless.
- We match on label too
(span["label"] == "ECONOMIST") so if "US Treasury" ever shows up correctly tagged as e.g. ORG elsewhere, we don't touch that one.
- Once you've verified the corrected dataset with prodigy print-dataset
ner_economist_speeches2_fixed, you can either point training at the new name directly
If the whole ECONOMIST label is polluted (you don't know upfront which spans are wrong)
Then re-annotation is the right call, but you don't want to redo the entire dataset — only the examples that contain ECONOMIST, while keeping all the other labels untouched. To filter examples by span label, you can use a simple python script:
"""
Split a Prodigy NER dataset by a label that needs re-annotation.
After running:
- <src>__polluted : examples containing at least one bad-label span
(all spans kept, for context)
- <src>__cleaned : every example, with the bad label stripped
Usage:
python split_for_relabel.py ner_economist_speeches2 ECONOMIST
"""
import sys
from prodigy.components.db import connect
def split(src: str, bad_label: str) -> None:
db = connect()
if src not in db.datasets:
raise SystemExit(f"Dataset {src!r} not found")
polluted_name = f"{src}_polluted"
cleaned_name = f"{src}_cleaned"
for name in (polluted_name, cleaned_name):
if name in db.datasets:
raise SystemExit(
f"{name!r} already exists — drop it first with `prodigy drop {name}`"
)
examples = db.get_dataset_examples(src)
polluted, cleaned = [], []
for eg in examples:
spans = eg.get("spans") or []
if any(s["label"] == bad_label for s in spans):
polluted.append(dict(eg))
cleaned_eg = dict(eg)
cleaned_eg["spans"] = [s for s in spans if s["label"] != bad_label]
cleaned.append(cleaned_eg)
db.add_dataset(polluted_name)
db.add_examples(polluted, datasets=[polluted_name])
db.add_dataset(cleaned_name)
db.add_examples(cleaned, datasets=[cleaned_name])
print(f"{src}: {len(examples)} examples")
print(f" → {polluted_name}: {len(polluted)} (contain {bad_label})")
print(f" → {cleaned_name}: {len(cleaned)} ({bad_label} stripped)")
if __name__ == "__main__":
if len(sys.argv) != 3:
raise SystemExit("usage: python split_for_relabel.py <dataset> <bad_label>")
split(sys.argv[1], sys.argv[2])
Then:
- split
python split_for_relabel.py ner_economist_speeches2 ECONOMIST
This will create two datasets: _polutted & _cleaned. _polutted will only contain the examples that contain the polutted labels(s) while _cleaned will contain all the other examples (excluding the polutted ones)
- relabel
prodigy ner.manual ner_econ_relabeled dataset:ner_economist_speeches2_polluted --label PERSON,ORG,ECON_ORG,ECONOMIST...
If you're sure they only two labels that need to be modified are ECON_ORG and ECONOMIST than specify just these as --labels
- train on cleaned-rest + relabeled-polluted
prodigy train ./out --ner ner_economist_speeches2_cleaned,ner_econ_relabeled
The trainrecipe takes care of the merge (so does data-to-spacy). The merge is conflict-free because the spans are grouped by input hash and dedupe by (start, end, label). For polluted examples, _cleaned contributes everything except ECONOMIST and ner_econ_relabeled contributes everything plus ECON_ORG (with no ECONOMIST left), so the union is exactly what you want. For non-polluted examples, only _cleaned contributes and they're unchanged.
Heads up: don't reach for prodigy db-merge here — it's plain concatenation and would put two copies of the same input into the merged dataset with conflicting spans. Combining at training time via train or data-to-spacy is the right path.
Finally, if you'd rather not do any extra scripting you can re-annotate the polutted dataset and review to reconcile the conflicts manually.
- re-annotate the source texts with ECON_ORG / ECONOMIST only
prodigy ner.manual ner_econ_org_reannotated ner_economist_speeches2 --label ECON_ORG,ECONOMIST
- review — surfaces every example where the two datasets disagree on a span,
prodigy review ner_economist_speeches_rev ner_economist_speeches2,ner_econ_org_reannotated --view-id ner_manual
- train on the reviewed dataset
prodigy train ./out --ner ner_economist_speeches_rev
prodigy review groups examples by input hash and shows the conflicting span versions side by side so the annotator can resolve them. The output is a single clean dataset you can train on directly.
Previous suggestions are more time efficient but require an extra scripting step to split and filter the dataset.