unannotated labels from meta.json file

I trained a model using

!python -m prodigy train --ner test_resume 2023-02-08_test_model --eval-split 0.2

This consists of labels :

"ADDRESS","COMPANY","DATE","DEGREE","DOMAIN","EXPERIENCE","HARDSKILLS","PERSON","ROLE","SOFTSKILLS","UNIVERSITY"

but after training this labels come like:

"ner":[
   "ADDRESS",
   "COMPANY",
   "DATE",
   "DEGREE",
   "DOMAIN",
   "EXPERIENCE",
   "HARDSKILL",
   "HARDSKILLS",
   "PERSON",
   "ROLE",
   "SOFTSKILL",
   "SOFTSKILLS",
   "UNIVERSITY"
 ]

How to eliminate the SOFTSKILL and HARDSKILL?

Annotated JSONL file do not consists SOFTSKILL, and HARDSKILL

hi @kushalrsharma!

It seems like you have labels that were misnamed SOFTSKILL and HARDSKILL in your annotations.

The simplest approach is change those labels. Let's say your annotations are in a Prodigy dataset called ner_dataset:

python -m db-out ner_dataset > ner_dataset.jsonl

Then run a python script that changes any span labels with SOFTSKILL and HARDSKILL to SOFTSKILLS and HARDSKILLS. This isn't the most elegant but a simple way to correct those:

import srsly

examples = srsly.read_jsonl("ner_dataset.jsonl")

updated_examples=[]
for eg in examples:
    for span in eg.get("spans"):
        if span["label"]=="SOFTSKILL":
            span["label"]="SOFTSKILLS"
        if span["label"]=="HARDSKILL":
            span["label"]="HARDSKILLS"
    updated_examples.append(eg)

srsly.write_jsonl("new_ner_dataset.jsonl", updated_examples)

Then reload that new .jsonl with db-in:

python -m db-in new_ner_dataset new_ner_dataset.jsonl

That should work. Now you should be able to train:

python -m prodigy train --ner new_ner_dataset my_model --eval-split 0.2

What's more important though is take note that your labeling process has some gaps (e.g., allowing label mispellings) and try to find ways to improve. One way to prevent this in the future is to use a simple .txt file with your label names.

For example, in your main directory, have a file named labels.txt with:

ADDRESS
COMPANY
DATE
DEGREE
DOMAIN
EXPERIENCE
HARDSKILLS
PERSON
ROLE
SOFTSKILLS
UNIVERSITY

Now you can run:

python -m prodigy ner.manual ner_dataset blank:en my_input_data.jsonl --label labels.txt

Related, as I think you're likely using named multi-user sessions, be sure to set your PRODIGY_ALLOWED_SESSION in your Prodigy configuration. This prevents users from accidentally mistyping their session name, which can cause similar problems down the road.

For example, if PRODIGY_ALLOWED_SESSIONS=alex,jo, then only ?session=alex and ?session=jo would be allowed and other names would raise an error.

Hope this helps!