Export Annotated Data from ner.manual to get list of words per label

Hi Everybody,

is it possible to export the data from a ner.manual annotation project to get each word which was annotaed with the certain label as dictionary (Key: label, Value: List of annotated Words of this label).

I would like to check if the words which were annotated using prodigy are according to our annotation guideline per label to ensure a. high quality.

Did anybody of you know a way to get such a list.

Thanks for your help
BR
Ralf

Hi! That definitely sounds like a good plan :+1:

Prodigy lets you interact with the database and annotations from Python, so you can write any custom logic that goes over your annotations and compiles stats about them. You can find an example of the JSON format for named entities here: https://prodi.gy/docs/api-interfaces#ner_manual As you can see, this has all the info you need: a list of annotated "spans" containing the start and end character offset of the annotated word, and the associated label. So you could do something like this:

from prodigy.components.db import connect
from collections import defaultdict

db = connect()
examples = db.get_dataset("name_of_your_dataset")

label_stats = defaultdict(list)
for eg in examples:
    if eg["answer"] == "accept":  # you probably want to exclude ignored/rejected answers?
        for span in eg.get("spans", []):
            word = eg["text"][span["start"]:span["end"]]  # slice of the text
            label = span["label"]
            label_stats[label].append(word)

print(label_stats)
2 Likes

Screenshot 2022-09-06 223751
Hi Ines,
somehow it doesn't work in JupyterLab.
Do you know why?
Kind regards

hi @do12siwu!

That looks like Prodigy isn't set up correctly in JupyterLab.

A few questions to diagnose:

  • How did you install prodigy and did you install it in a virtual environment?
  • Can you run these commands in a python shell, just not in a jupyter notebook?

If you can run these in python shell but not Jupyter, then I suspect your issue is your Jupyter notebook isn't using your virtual environment. The easiest way is to activate your virtual environment, then start your jupyter server.

But there are alternatives as this blog discusses:

python -m ipykernel install --user --name=myenv

Let us know if this works!

Screenshot 2022-09-06 225449
Thanks my fault.
But what does this mean? Why is the list empty?
Kind regards

hi @do12siwu!

To help make things easier, can you avoid pasting in images and use the code feature to paste in your code? This makes it much easier for us to replicate.

Can you explain what you're trying to accomplish? I saw you changed for span in eg.get("span", []): to for span in eg.get("REGEX", []):. What are you trying to do here?

The problem is in your examples -- I assume koText is a set of annotations from ner.manual -- doesn't have a "REGEX" key. The get method is will get the key with the accompanying key name.

This is what an example annotation looks like for the ner.manual recipe:

from prodigy.components.db import connect
db = connect()
examples = db.get_dataset("ner_manual")
import pprint
pprint.pprint(examples[0])
{'_input_hash': -136499144,
 '_is_binary': False,
 '_task_hash': -986839541,
 '_timestamp': 1662560235,
 '_view_id': 'ner_manual',
 'answer': 'accept',
 'spans': [{'end': 33,
            'label': 'PRODUCT',
            'start': 22,
            'token_end': 6,
            'token_start': 5}],
 'text': 'First look at the new MacBook Pro.',
 'tokens': [{'end': 5, 'id': 0, 'start': 0, 'text': 'First', 'ws': True},
            {'end': 10, 'id': 1, 'start': 6, 'text': 'look', 'ws': True},
            {'end': 13, 'id': 2, 'start': 11, 'text': 'at', 'ws': True},
            {'end': 17, 'id': 3, 'start': 14, 'text': 'the', 'ws': True},
            {'end': 21, 'id': 4, 'start': 18, 'text': 'new', 'ws': True},
            {'end': 29, 'id': 5, 'start': 22, 'text': 'MacBook', 'ws': True},
            {'end': 33, 'id': 6, 'start': 30, 'text': 'Pro', 'ws': False},
            {'end': 34, 'id': 7, 'start': 33, 'text': '.', 'ws': False}]}

Hello ryanwesslen,

I'm sorry for asking this in such a long-winded and complicated way.
I am still a beginner as far as programming is concerned.
My dataset is called "koText". This is a dataset that has already been labeled by me.
What I was trying to do was extracting all the parts of the text that were labeled as REGEX and saving it in a list.

Kind regards

hi @do12siwu!

No problem at all! We're all learning and you're doing great :slight_smile:

Oh - so is your LABEL is named REGEX?

If so, you can then use this code:

from prodigy.components.db import connect
from collections import defaultdict

db = connect()
examples = db.get_dataset("koText")

label_stats = defaultdict(list)
for eg in examples:
    if eg["answer"] == "accept":  # you probably want to exclude ignored/rejected answers?
        for span in eg.get("spans", []):
            word = eg["text"][span["start"]:span["end"]]  # slice of the text
            label = span["label"]
            if label == "REGEX": # only keep those spans with "REGEX" labels
                label_stats[label].append(word)

print(label_stats["REGEX"])

This will put into a list all of your annotated spans with the label "REGEX". Does this solve your problem?

Thank you for your patience.
Yes, the Label is named REGEX.
I get this error after typing your code.


KeyError Traceback (most recent call last)
Input In [12], in <cell line: 8>()
9 if eg["answer"] == "accept": # you probably want to exclude ignored/rejected answers?
10 for span in eg.get("spans", ):
---> 11 word = eg["text"][span["start"]:span["end"]] # slice of the text
12 label = span["label"]
13 if label == "REGEX": # only keep those spans with "REGEX" labels

KeyError: 'start'

Are you sure you used ner.manual recipe to get your annotations for your koText dataset?

Can you run this:

from prodigy.components.db import connect
db = connect()
examples = db.get_dataset("koText")
print(examples[0])

I want to get an example of what your data looks like.

I used rel.manual recipe.

After running your code I get this:

It's hard to tell but the error suggests that one of your spans doesn't have a start key (hence why you got a KeyError). I can't tell which one it is and you'd need to dig in a bit more to diagnose.

The simple alternative is to wrap those lines with a try ... except like this:

from prodigy.components.db import connect
from collections import defaultdict

db = connect()
examples = db.get_dataset("koText")

label_stats = defaultdict(list)
for eg in examples:
    if eg["answer"] == "accept":  # you probably want to exclude ignored/rejected answers?
        for span in eg.get("spans", []):
            try:
                word = eg["text"][span["start"]:span["end"]]  # slice of the text
                label = span["label"]
                if label == "REGEX":
                    label_stats[label].append(word)
            except:
                continue

print(label_stats["REGEX"])
1 Like

Thank you, Ryan!
It works now!!!