Export Annotated Data from ner.manual to get list of words per label

rmeier · January 31, 2021, 8:49am

Hi Everybody,

is it possible to export the data from a ner.manual annotation project to get each word which was annotaed with the certain label as dictionary (Key: label, Value: List of annotated Words of this label).

I would like to check if the words which were annotated using prodigy are according to our annotation guideline per label to ensure a. high quality.

Did anybody of you know a way to get such a list.

Thanks for your help
BR
Ralf

ines · February 2, 2021, 2:45am

Hi! That definitely sounds like a good plan

Prodigy lets you interact with the database and annotations from Python, so you can write any custom logic that goes over your annotations and compiles stats about them. You can find an example of the JSON format for named entities here: https://prodi.gy/docs/api-interfaces#ner_manual As you can see, this has all the info you need: a list of annotated "spans" containing the start and end character offset of the annotated word, and the associated label. So you could do something like this:

from prodigy.components.db import connect
from collections import defaultdict

db = connect()
examples = db.get_dataset("name_of_your_dataset")

label_stats = defaultdict(list)
for eg in examples:
    if eg["answer"] == "accept":  # you probably want to exclude ignored/rejected answers?
        for span in eg.get("spans", []):
            word = eg["text"][span["start"]:span["end"]]  # slice of the text
            label = span["label"]
            label_stats[label].append(word)

print(label_stats)

do12siwu · September 6, 2022, 8:39pm

Screenshot 2022-09-06 223751
Hi Ines,
somehow it doesn't work in JupyterLab.
Do you know why?
Kind regards

ryanwesslen · September 6, 2022, 8:46pm

hi @do12siwu!

That looks like Prodigy isn't set up correctly in JupyterLab.

A few questions to diagnose:

How did you install prodigy and did you install it in a virtual environment?
Can you run these commands in a python shell, just not in a jupyter notebook?

If you can run these in python shell but not Jupyter, then I suspect your issue is your Jupyter notebook isn't using your virtual environment. The easiest way is to activate your virtual environment, then start your jupyter server.

But there are alternatives as this blog discusses:

python -m ipykernel install --user --name=myenv

Let us know if this works!

do12siwu · September 6, 2022, 8:55pm

Screenshot 2022-09-06 225449
Thanks my fault.
But what does this mean? Why is the list empty?
Kind regards

ryanwesslen · September 7, 2022, 2:32pm

hi @do12siwu!

To help make things easier, can you avoid pasting in images and use the code feature to paste in your code? This makes it much easier for us to replicate.

Can you explain what you're trying to accomplish? I saw you changed for span in eg.get("span", []): to for span in eg.get("REGEX", []):. What are you trying to do here?

The problem is in your examples -- I assume koText is a set of annotations from ner.manual -- doesn't have a "REGEX" key. The get method is will get the key with the accompanying key name.

This is what an example annotation looks like for the ner.manual recipe:

from prodigy.components.db import connect
db = connect()
examples = db.get_dataset("ner_manual")
import pprint
pprint.pprint(examples[0])
{'_input_hash': -136499144,
 '_is_binary': False,
 '_task_hash': -986839541,
 '_timestamp': 1662560235,
 '_view_id': 'ner_manual',
 'answer': 'accept',
 'spans': [{'end': 33,
            'label': 'PRODUCT',
            'start': 22,
            'token_end': 6,
            'token_start': 5}],
 'text': 'First look at the new MacBook Pro.',
 'tokens': [{'end': 5, 'id': 0, 'start': 0, 'text': 'First', 'ws': True},
            {'end': 10, 'id': 1, 'start': 6, 'text': 'look', 'ws': True},
            {'end': 13, 'id': 2, 'start': 11, 'text': 'at', 'ws': True},
            {'end': 17, 'id': 3, 'start': 14, 'text': 'the', 'ws': True},
            {'end': 21, 'id': 4, 'start': 18, 'text': 'new', 'ws': True},
            {'end': 29, 'id': 5, 'start': 22, 'text': 'MacBook', 'ws': True},
            {'end': 33, 'id': 6, 'start': 30, 'text': 'Pro', 'ws': False},
            {'end': 34, 'id': 7, 'start': 33, 'text': '.', 'ws': False}]}

do12siwu · September 7, 2022, 3:32pm

Hello ryanwesslen,

I'm sorry for asking this in such a long-winded and complicated way.
I am still a beginner as far as programming is concerned.
My dataset is called "koText". This is a dataset that has already been labeled by me.
What I was trying to do was extracting all the parts of the text that were labeled as REGEX and saving it in a list.

Kind regards

ryanwesslen · September 7, 2022, 3:48pm

hi @do12siwu!

No problem at all! We're all learning and you're doing great

Oh - so is your LABEL is named REGEX?

If so, you can then use this code:

from prodigy.components.db import connect
from collections import defaultdict

db = connect()
examples = db.get_dataset("koText")

label_stats = defaultdict(list)
for eg in examples:
    if eg["answer"] == "accept":  # you probably want to exclude ignored/rejected answers?
        for span in eg.get("spans", []):
            word = eg["text"][span["start"]:span["end"]]  # slice of the text
            label = span["label"]
            if label == "REGEX": # only keep those spans with "REGEX" labels
                label_stats[label].append(word)

print(label_stats["REGEX"])

This will put into a list all of your annotated spans with the label "REGEX". Does this solve your problem?

do12siwu · September 7, 2022, 3:57pm

Thank you for your patience.
Yes, the Label is named REGEX.
I get this error after typing your code.

KeyError Traceback (most recent call last)
Input In [12], in <cell line: 8>()
9 if eg["answer"] == "accept": # you probably want to exclude ignored/rejected answers?
10 for span in eg.get("spans", ):
---> 11 word = eg["text"][span["start"]:span["end"]] # slice of the text
12 label = span["label"]
13 if label == "REGEX": # only keep those spans with "REGEX" labels

KeyError: 'start'

ryanwesslen · September 7, 2022, 4:45pm

Are you sure you used ner.manual recipe to get your annotations for your koText dataset?

Can you run this:

from prodigy.components.db import connect
db = connect()
examples = db.get_dataset("koText")
print(examples[0])

I want to get an example of what your data looks like.

do12siwu · September 7, 2022, 4:49pm

I used rel.manual recipe.

After running your code I get this:

ryanwesslen · September 7, 2022, 6:20pm

It's hard to tell but the error suggests that one of your spans doesn't have a start key (hence why you got a KeyError). I can't tell which one it is and you'd need to dig in a bit more to diagnose.

The simple alternative is to wrap those lines with a try ... except like this:

from prodigy.components.db import connect
from collections import defaultdict

db = connect()
examples = db.get_dataset("koText")

label_stats = defaultdict(list)
for eg in examples:
    if eg["answer"] == "accept":  # you probably want to exclude ignored/rejected answers?
        for span in eg.get("spans", []):
            try:
                word = eg["text"][span["start"]:span["end"]]  # slice of the text
                label = span["label"]
                if label == "REGEX":
                    label_stats[label].append(word)
            except:
                continue

print(label_stats["REGEX"])

do12siwu · September 7, 2022, 6:23pm

Thank you, Ryan!
It works now!!!

Topic		Replies	Views
CSV with NER classifications to dataset usage	1	1562	December 13, 2018
start to annotate pre-defined labels in python usage , solved	3	818	May 2, 2019
show annotated text dataset usage , database , solved	1	437	June 6, 2019
Datasets and using pre-annotated data Getting Started usage , solved	23	5515	November 15, 2020
prodigy use case for annotation having pre-annotated text usage , solved	8	1263	March 11, 2019

Export Annotated Data from ner.manual to get list of words per label

Related topics