You can export your dataset by running the db-out
command and then check the JSONL file:
prodigy db-out resume_ner > resume_ner.jsonl
After you’ve removed the problematic spans or have corrected them, you can then reimport the data to a new dataset:
prodigy db-in resume_ner_fixed resume_ner.jsonl
You can probably also write a script to find the problematic entities automatically and then exclude them, and add the result to a new dataset. I haven’t tested this yet, but something like this should work:
from prodigy.components.db import connect
db = connect()
examples = db.get_dataset("resume_ner")
fixed_examples = []
def is_whitespace_entity(text):
whitespace = (" ", "\n") # etc.
if text.startswith(whitespace) or text.endswith(whitespace):
return True
for char in whitespace:
if text == char:
return True
return False
for eg in examples:
new_spans = []
for span in eg.get("spans", []):
entity = eg["text"][span["start"]:span["end"]]
if not is_whitespace_entity(entity):
new_spans.append(span)
eg["spans"] = new_spans
fixed_examples.append(eg)
db.add_dataset("resume_ner_fixed")
db.add_examples(fixed_examples, ["resume_ner_fixed"])