Hi,
I have been looking around on this forum to search for correcting bad labels for NER with Jupyter and prodigy
Problem:
I annotated my text with labels "PERSON" and "REF", i.e. reference. The label 'PERSON' is assigned to text such as 'Jan Jansen' and the label "REF' to text such as 'Jansen et al.'. But while annotating, I accidently labeled some individual cases containing 'et al.' as "PERSON". Now, I want to label those individual cases to "REF"!
Following are the possible solutions:
Solution 1: as discussed in post Corrections on an already annotated NER dataset ,
python -m prodigy ner.manual new_ner_dataset blank:en dataset:ner_dataset --label PERSON,REF
This solution means that you have to go through all the examples individually in prodigy annotation interface (458 in my case), looking at highlighted texts where a wrong label is given, and subsequently correcting it to right label.
Solution 2 as given in Renaming labels in NER - #6 by bob_ln
It involves changing the label from "PERSON' to 'REF' in .jsonl file
> sed 's,"label":"OLD_LABEL","label":"NEW_LABEL",g' old_task.jsonl > new_task.jsonl
But I do not want the wholesome change of labels from ' PERSON' to ' REF' , but change only few bad labels.
Solution 3 I looked into handling individual labels by ryanwesslen' from Export Annotated Data from ner.manual to get list of words per label - #6 by ryanwesslen. In Jupyter notebook, I opened the annotated dataset, into spans, with condition that the label is 'PERSON'. Within this selection I searched if the text contains string 'et al'. If the string is in text, Change the label to 'REF'.
from prodigy.components.db import connect
db = connect("sqlite", {"name": "prodigy.db"})
examples = db.get_dataset("sample03_dataset")
from collections import defaultdict
label_stats=defaultdict(list)
for eg in examples:
if eg["answer"] == "accept": # you probably want to exclude ignored/rejected answers?
for span in eg.get("spans", []):
try:
word = eg["text"][span["start"]:span["end"]] # slice of the text
label = span["label"]
if label == "PERSON":
if 'et al' in word:
label = "REF"
label_stats[label].append(word)
except:
continue
print(label_stats["REF"])
This converts the annotations, which were previously under label ' PERSON', to label 'REF'
['Whiteside et al.', -----, 'Shigeoka et al.']
To check whether these annotations are added to the already 'REF' labeled annotations, I wrote following check:
for eg in examples:
if eg["answer"] == "accept": # you probably want to exclude ignored/rejected answers?
for span in eg.get("spans", []):
try:
word = eg["text"][span["start"]:span["end"]] # slice of the text
label = span["label"]
if label == "REF":
label_stats[label].append(word)
except:
continue
print(label_stats["REF"])
The output confirms my hunch.
['Whiteside et al.', ----- 'Shigeoka et al.', 'Wang et al.', ------ 'Speicher et al.']
However here stops my knowledge of python and working with jsonl file. I tried to convert the dataset to pandas dataframe and export as jsonl file.
import json
import pandas as pd
df=pd.DataFrame(examples)
df.to_json("sample03.jsonl", orient="records", lines=True)
However, when I db-in the json file to prodigy and and check the dataset with the above mentioned method by rynwesslen, I see no change has been recorded in the dataset. Ergo, the changes I made in jupyter notebook are not reported in the dataset.
Solution 4: I watched youtube video Finding BAD LABELS for TEXT CLASSIFICATION with Jupyter and Prodigy by Vincent. He searches for a word ( 'exciting) in the provided text and if found correctly, labels the text as 'excitement' after deliberation. However I do not know how to apply this procedure to entity labeling. He also works with a csv input file, while my case is working with jsonl.
||text|_input_hash|_task_hash|tokens|spans|_is_binary|_view_id|answer|_timestamp|
|---|---|---|---|---|---|---|---|---|---|---|
|0|Early career track My entire early career h...|-255810039|-317197356|[{'text': '', 'start': 0, 'end': 1, 'id': 0, ...|[{'token_start': 132, 'token_end': 133, 'start...|False|ner_manual|accept|1670394044|
|1|My research focusses on studying the molecula...|-1432622401|-20052039|[{'text': '', 'start': 0, 'end': 1, 'id': 0, ...|[{'start': 148, 'end': 180, 'text': 'the Nethe...|False|ner_manual|accept|1670394072|