correcting bad labels for NER with Jupyter and prodigy

rahul1 · December 10, 2022, 10:42am

Hi,

I have been looking around on this forum to search for correcting bad labels for NER with Jupyter and prodigy

Problem:
I annotated my text with labels "PERSON" and "REF", i.e. reference. The label 'PERSON' is assigned to text such as 'Jan Jansen' and the label "REF' to text such as 'Jansen et al.'. But while annotating, I accidently labeled some individual cases containing 'et al.' as "PERSON". Now, I want to label those individual cases to "REF"!

Following are the possible solutions:

Solution 1: as discussed in post Corrections on an already annotated NER dataset ,

python -m prodigy ner.manual new_ner_dataset blank:en dataset:ner_dataset --label PERSON,REF

This solution means that you have to go through all the examples individually in prodigy annotation interface (458 in my case), looking at highlighted texts where a wrong label is given, and subsequently correcting it to right label.

Solution 2 as given in Renaming labels in NER - #6 by bob_ln
It involves changing the label from "PERSON' to 'REF' in .jsonl file

> sed 's,"label":"OLD_LABEL","label":"NEW_LABEL",g' old_task.jsonl > new_task.jsonl

But I do not want the wholesome change of labels from ' PERSON' to ' REF' , but change only few bad labels.

Solution 3 I looked into handling individual labels by ryanwesslen' from Export Annotated Data from ner.manual to get list of words per label - #6 by ryanwesslen. In Jupyter notebook, I opened the annotated dataset, into spans, with condition that the label is 'PERSON'. Within this selection I searched if the text contains string 'et al'. If the string is in text, Change the label to 'REF'.

from prodigy.components.db import connect
db = connect("sqlite", {"name": "prodigy.db"})

examples = db.get_dataset("sample03_dataset")

from collections import defaultdict
label_stats=defaultdict(list)
for eg in examples:
    if eg["answer"] == "accept":  # you probably want to exclude ignored/rejected answers?
        for span in eg.get("spans", []):
            try:
                word = eg["text"][span["start"]:span["end"]]  # slice of the text
                label = span["label"]
                if label == "PERSON":
                    if 'et al' in word:
                        label = "REF"
                        label_stats[label].append(word)
            except:
                continue

print(label_stats["REF"])

This converts the annotations, which were previously under label ' PERSON', to label 'REF'

['Whiteside et al.', -----, 'Shigeoka et al.']

To check whether these annotations are added to the already 'REF' labeled annotations, I wrote following check:

for eg in examples:
    if eg["answer"] == "accept":  # you probably want to exclude ignored/rejected answers?
        for span in eg.get("spans", []):
            try:
                word = eg["text"][span["start"]:span["end"]]  # slice of the text
                label = span["label"]
                if label == "REF":
                    label_stats[label].append(word)
            except:
                continue

print(label_stats["REF"])

The output confirms my hunch.

['Whiteside et al.', ----- 'Shigeoka et al.', 'Wang et al.', ------ 'Speicher et al.']

However here stops my knowledge of python and working with jsonl file. I tried to convert the dataset to pandas dataframe and export as jsonl file.

import json
import pandas as pd
df=pd.DataFrame(examples)
df.to_json("sample03.jsonl", orient="records", lines=True)

However, when I db-in the json file to prodigy and and check the dataset with the above mentioned method by rynwesslen, I see no change has been recorded in the dataset. Ergo, the changes I made in jupyter notebook are not reported in the dataset.

Solution 4: I watched youtube video Finding BAD LABELS for TEXT CLASSIFICATION with Jupyter and Prodigy by Vincent. He searches for a word ( 'exciting) in the provided text and if found correctly, labels the text as 'excitement' after deliberation. However I do not know how to apply this procedure to entity labeling. He also works with a csv input file, while my case is working with jsonl.

||text|_input_hash|_task_hash|tokens|spans|_is_binary|_view_id|answer|_timestamp|
|---|---|---|---|---|---|---|---|---|---|---|
|0|Early career track My entire early career h...|-255810039|-317197356|[{'text': '', 'start': 0, 'end': 1, 'id': 0, ...|[{'token_start': 132, 'token_end': 133, 'start...|False|ner_manual|accept|1670394044|
|1|My research focusses on studying the molecula...|-1432622401|-20052039|[{'text': '', 'start': 0, 'end': 1, 'id': 0, ...|[{'start': 148, 'end': 180, 'text': 'the Nethe...|False|ner_manual|accept|1670394072|

koaning · December 13, 2022, 10:41am

Prodigy is an append only annotation tool, as explained here.

So the most "straightforward" way to get an improved dataset is to use the old dataset as input for the new one. Another option might be to use the review recipe here?

rahul1 · December 13, 2022, 2:44pm

Thanks, I will try the review recipe.

Topic		Replies	Views
How to overwrite/correct annotations? ner , solved	7	2067	September 7, 2021
Renaming labels in NER usage , ner , database , solved	6	1606	November 15, 2022
Best way to re-label / re-annotate existing data based on condition ner	1	423	September 19, 2022
Adding new label usage , ner	5	1339	November 8, 2021
How to modify dataset marked?	1	185	March 20, 2023

correcting bad labels for NER with Jupyter and prodigy

Related topics