check ner annotation distribution

Zim1-finest · December 22, 2021, 9:15am

Hi,

I annotated about 5000 tweets with 2 named entities. Is it possible to check how many of the 5000 tweets have entity 1, entity 2, no entities and both entity 1 and entity 2?

ines · December 22, 2021, 11:29am

Hi! I think in this case, it's probably easiest to write your own script that compiles these counts based on your data. You can connect to the database in Python and load your annotations as a list of dictionaries. You could then do something like this:

from prodigy.components.db import connect
from collections import Counter

label1 = "LABEL1"
label2 = "LABEL2"

db = connect()
examples = db.get_dataset("your_dataset")
counts = Counter()

for eg in examples:
    labels = [span["label"] for span in eg.get("spans", [])]
    if len(labels) == 0:
        counts["no_ents"] += 1
    if label1 in labels and label2 in labels:
        counts["both"] += 1
    if label1 in labels and not label2 in labels:
        counts[label1] += 1
    if label2 in labels and not label1 in labels:
        counts[label2] += 1
print(counts)

The above code just counts whether an example contains only label 1 or 2, or both, or none. You could also compile more fine-grained stats, e.g. how many examples contain no entities, 1 entity span, 2, and so on. This really depends on the stats you're most interested in.

Zim1-finest · December 22, 2021, 6:11pm

oh great,

works like a charm.

Thank you so much...

Topic		Replies	Views
interpreting ner.train results usage , ner , solved , training	2	374	November 24, 2021
Track of new entities added usage , ner	1	407	December 8, 2018
Help with messy data usage , ner	8	666	January 20, 2019
Inconsistency Number of Annotated Data ner , textcat	10	27	November 27, 2024
Export Annotated Data from ner.manual to get list of words per label usage , ner , database	12	956	September 7, 2022

check ner annotation distribution

Related topics