ner.batch-train after ner.maual results error (Value error : [E024])

You can export your dataset by running the db-out command and then check the JSONL file:

prodigy db-out resume_ner > resume_ner.jsonl

After you’ve removed the problematic spans or have corrected them, you can then reimport the data to a new dataset:

prodigy db-in resume_ner_fixed resume_ner.jsonl 

You can probably also write a script to find the problematic entities automatically and then exclude them, and add the result to a new dataset. I haven’t tested this yet, but something like this should work:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("resume_ner")
fixed_examples = []

def is_whitespace_entity(text):
    whitespace = (" ", "\n")  # etc.
    if text.startswith(whitespace) or text.endswith(whitespace):
        return True
    for char in whitespace:
        if text == char:
            return True
    return False

for eg in examples:
    new_spans = []
    for span in eg.get("spans", []):
        entity = eg["text"][span["start"]:span["end"]]
        if not is_whitespace_entity(entity):
            new_spans.append(span)
    eg["spans"] = new_spans
    fixed_examples.append(eg)

db.add_dataset("resume_ner_fixed")
db.add_examples(fixed_examples, ["resume_ner_fixed"])
1 Like