check ner annotation distribution


I annotated about 5000 tweets with 2 named entities. Is it possible to check how many of the 5000 tweets have entity 1, entity 2, no entities and both entity 1 and entity 2?

Hi! I think in this case, it's probably easiest to write your own script that compiles these counts based on your data. You can connect to the database in Python and load your annotations as a list of dictionaries. You could then do something like this:

from prodigy.components.db import connect
from collections import Counter

label1 = "LABEL1"
label2 = "LABEL2"

db = connect()
examples = db.get_dataset("your_dataset")
counts = Counter()

for eg in examples:
    labels = [span["label"] for span in eg.get("spans", [])]
    if len(labels) == 0:
        counts["no_ents"] += 1
    if label1 in labels and label2 in labels:
        counts["both"] += 1
    if label1 in labels and not label2 in labels:
        counts[label1] += 1
    if label2 in labels and not label1 in labels:
        counts[label2] += 1

The above code just counts whether an example contains only label 1 or 2, or both, or none. You could also compile more fine-grained stats, e.g. how many examples contain no entities, 1 entity span, 2, and so on. This really depends on the stats you're most interested in.

1 Like

oh great,

works like a charm.

Thank you so much...

1 Like