hi @kushalrsharma!
It seems like you have labels that were misnamed SOFTSKILL
and HARDSKILL
in your annotations.
The simplest approach is change those labels. Let's say your annotations are in a Prodigy dataset called ner_dataset
:
python -m db-out ner_dataset > ner_dataset.jsonl
Then run a python script that changes any span labels with SOFTSKILL
and HARDSKILL
to SOFTSKILLS
and HARDSKILLS
. This isn't the most elegant but a simple way to correct those:
import srsly
examples = srsly.read_jsonl("ner_dataset.jsonl")
updated_examples=[]
for eg in examples:
for span in eg.get("spans"):
if span["label"]=="SOFTSKILL":
span["label"]="SOFTSKILLS"
if span["label"]=="HARDSKILL":
span["label"]="HARDSKILLS"
updated_examples.append(eg)
srsly.write_jsonl("new_ner_dataset.jsonl", updated_examples)
Then reload that new .jsonl
with db-in
:
python -m db-in new_ner_dataset new_ner_dataset.jsonl
That should work. Now you should be able to train:
python -m prodigy train --ner new_ner_dataset my_model --eval-split 0.2
What's more important though is take note that your labeling process has some gaps (e.g., allowing label mispellings) and try to find ways to improve. One way to prevent this in the future is to use a simple .txt
file with your label names.
For example, in your main directory, have a file named labels.txt
with:
ADDRESS
COMPANY
DATE
DEGREE
DOMAIN
EXPERIENCE
HARDSKILLS
PERSON
ROLE
SOFTSKILLS
UNIVERSITY
Now you can run:
python -m prodigy ner.manual ner_dataset blank:en my_input_data.jsonl --label labels.txt
Related, as I think you're likely using named multi-user sessions, be sure to set your PRODIGY_ALLOWED_SESSION
in your Prodigy configuration. This prevents users from accidentally mistyping their session name, which can cause similar problems down the road.
For example, if PRODIGY_ALLOWED_SESSIONS=alex,jo
, then only ?session=alex
and ?session=jo
would be allowed and other names would raise an error.
Hope this helps!