Export Annotated Data from ner.manual to get list of words per label

Hi Everybody,

is it possible to export the data from a ner.manual annotation project to get each word which was annotaed with the certain label as dictionary (Key: label, Value: List of annotated Words of this label).

I would like to check if the words which were annotated using prodigy are according to our annotation guideline per label to ensure a. high quality.

Did anybody of you know a way to get such a list.

Thanks for your help
BR
Ralf

Hi! That definitely sounds like a good plan :+1:

Prodigy lets you interact with the database and annotations from Python, so you can write any custom logic that goes over your annotations and compiles stats about them. You can find an example of the JSON format for named entities here: https://prodi.gy/docs/api-interfaces#ner_manual As you can see, this has all the info you need: a list of annotated "spans" containing the start and end character offset of the annotated word, and the associated label. So you could do something like this:

from prodigy.components.db import connect
from collections import defaultdict

db = connect()
examples = db.get_dataset("name_of_your_dataset")

label_stats = defaultdict(list)
for eg in examples:
    if eg["answer"] == "accept":  # you probably want to exclude ignored/rejected answers?
        for span in eg.get("spans", []):
            word = eg["text"][span["start"]:span["end"]]  # slice of the text
            label = span["label"]
            label_stats[label].append(word)

print(label_stats)
1 Like