Hard limit on consecutive tokens in NER annotations

I'm working with clients on getting them upskilled in annotation. They actually do annotation work on a custom built system and I'm getting them setup with Prodigy to get better quality annotations moving forward.

One main issue they have is over-annotating (e.g. annotating 10 words in a row to capture the full context where the core thing they are annotating is only 3 words long)

Would it be possible to add strict token length constraint to annotations? Would love to hear your thoughts on this.

Ha, this is probably the best-timed enhancement proposal because I just implemented something for this today :sweat_smile: So v1.10 should have you covered here, and you'll be able to provide a validate_answer callback that's called on every answer when the annotator hits "accept" or "reject" and lets you raise custom errors that are then shown as alerts in the UI.

So you could do something like this:

def validate_answer(eg):
    for span in eg.get("spans", []):
        span_len = span["token_end"] - span["token_start"] + 1
        span_text = eg["text"][span["start"]:span["end"]]
        if span_len >= 10:
            raise ValueError(f"Selected span longer than 10 tokens: {span_text}")

You can either raise an error or use asserts with a message. The user will see the verbatim text of the error message, so you can use that to provide more info or explanation if needed.

That's incredibly great to hear! And a super powerful generic feature. Thanks!! Always impressed :smile:

1 Like

Just released Prodigy v1.10, which includes the validate_answer callback! See here for details and examples.