unannotated labels from meta.json file

kushalrsharma · February 8, 2023, 5:02pm

I trained a model using

!python -m prodigy train --ner test_resume 2023-02-08_test_model --eval-split 0.2

This consists of labels :

"ADDRESS","COMPANY","DATE","DEGREE","DOMAIN","EXPERIENCE","HARDSKILLS","PERSON","ROLE","SOFTSKILLS","UNIVERSITY"

but after training this labels come like:

"ner":[
   "ADDRESS",
   "COMPANY",
   "DATE",
   "DEGREE",
   "DOMAIN",
   "EXPERIENCE",
   "HARDSKILL",
   "HARDSKILLS",
   "PERSON",
   "ROLE",
   "SOFTSKILL",
   "SOFTSKILLS",
   "UNIVERSITY"
 ]

How to eliminate the SOFTSKILL and HARDSKILL?

Annotated JSONL file do not consists SOFTSKILL, and HARDSKILL

ryanwesslen · February 8, 2023, 6:12pm

hi @kushalrsharma!

It seems like you have labels that were misnamed SOFTSKILL and HARDSKILL in your annotations.

The simplest approach is change those labels. Let's say your annotations are in a Prodigy dataset called ner_dataset:

python -m db-out ner_dataset > ner_dataset.jsonl

Then run a python script that changes any span labels with SOFTSKILL and HARDSKILL to SOFTSKILLS and HARDSKILLS. This isn't the most elegant but a simple way to correct those:

import srsly

examples = srsly.read_jsonl("ner_dataset.jsonl")

updated_examples=[]
for eg in examples:
    for span in eg.get("spans"):
        if span["label"]=="SOFTSKILL":
            span["label"]="SOFTSKILLS"
        if span["label"]=="HARDSKILL":
            span["label"]="HARDSKILLS"
    updated_examples.append(eg)

srsly.write_jsonl("new_ner_dataset.jsonl", updated_examples)

Then reload that new .jsonl with db-in:

python -m db-in new_ner_dataset new_ner_dataset.jsonl

That should work. Now you should be able to train:

python -m prodigy train --ner new_ner_dataset my_model --eval-split 0.2

What's more important though is take note that your labeling process has some gaps (e.g., allowing label mispellings) and try to find ways to improve. One way to prevent this in the future is to use a simple .txt file with your label names.

For example, in your main directory, have a file named labels.txt with:

ADDRESS
COMPANY
DATE
DEGREE
DOMAIN
EXPERIENCE
HARDSKILLS
PERSON
ROLE
SOFTSKILLS
UNIVERSITY

Now you can run:

python -m prodigy ner.manual ner_dataset blank:en my_input_data.jsonl --label labels.txt

Related, as I think you're likely using named multi-user sessions, be sure to set your PRODIGY_ALLOWED_SESSION in your Prodigy configuration. This prevents users from accidentally mistyping their session name, which can cause similar problems down the road.

For example, if PRODIGY_ALLOWED_SESSIONS=alex,jo, then only ?session=alex and ?session=jo would be allowed and other names would raise an error.

Hope this helps!

Topic		Replies	Views
correcting bad labels for NER with Jupyter and prodigy usage , ner	2	593	December 13, 2022
Adding new label usage , ner	5	1431	November 8, 2021
'Cannot find label in model' when trying to train from pre-annotated data usage , ner , solved	11	1014	March 14, 2019
Is there a way to reduce the number of labels in the annotated dataset of prodi.gy? ner	2	286	December 5, 2022
Renaming labels in NER usage , ner , database , solved	6	1656	November 15, 2022

unannotated labels from meta.json file

Related topics