NER tag capitalization question

Hello - I'm running into a problem with some tags, and not sure what might be causing it!

I'm annotating some medical data, pulling out, among other things, biomarkers. Two tags I have are BIOMARKER_NAME and BIOMARKER_RESULT.

I was noticing the model seemed to be learning some things (like finding intensity scores, or other outcomes) very quickly, and not learning new biomarker names at all.

When I investigated the annotations with db-out, I saw that some of the tags had been saved as (lowercase) biomarker_name and biomarker_result. When I did something like ner.correct asking for the BIOMARKER_NAME or BIOMARKER_RESULT tags, none of the lowercased ones were being displayed.

Reviewing the commands to prodigy, I never used the lowercase label when doing any training. Any ideas what's going on?

hi @BenHolmes,

Thanks for your question and welcome to the Prodigy community :wave:

By "tags", I assume you mean label names.

Do you have the exact commands you ran during annotation?

For future purposes, please be sure to provide full Prodigy commands (and ideally Prodigy version too) as this can help debug.

Yes, but it sounds like you did use lowercase when running annotation as your labels are lower case in your data (i.e., ran prodigy ner.manual ... you used lowercase. How did you do training? Can you provide the full command (e.g., prodigy train vs spacy train)?

I'm suspecting you're running into this issue:

Probably your best bet now is to output your data, write a Python script to change the labels from lower case to upper case in your .jsonl (e.g., try ChatGPT, it's really great if you give it an example input and an example output for generic Python scripting), then reload your data with db-in. Sorry for the hassle, but as mentioned in that previous post, it's best to always annotation with capital letters for --label.

Hope this helps!

Thanks for the assistance!

For a sample case, I did a db-out on the old model, calling that file annotations_updated.jsonl

and altered the tags in it to make sure all were upper-case (changed 'biomarker_name') to ('BIOMARKER_NAME') etc.

I then ran

prodigy db-in biomarkers_standardized ./annotations_updated.jsonl

prodigy train ./prodigy_models --ner ner_biomarkers_standardized

at this point, there were 8 labels: BIOMARKER_NAME, BIOMARKER_RESULT, PERCENTAGE, ALLRED, PLATFORM, PDOT, CDOT, CLONE

I then ran

prodigy ner.correct ner_biomarkers_standardized [model location] [.jsonl location] --label BIOMARKER_NAME,BIOMARKER_RESULT,PERCENTAGE,ALLRED,PLATFORM,PDOT,CDOT,CLONE

(where [model location] and [.jsonl location] are the locations of the model-best folder and the .jsonl file containing the reports of interest)

a few times, annotating some reports

I then tried to run

prodigy ner.teach ner_biomarkers_standardized [model location] [.jsonl location] --label BIOMARKER_NAME,BIOMARKER_RESULT,PERCENTAGE,ALLRED,PLATFORM,PDOT,CDOT,CLONE

and was told

:information_source: Available labels in model en_pipeline: ['ALLRED', 'BIOMARKER_NAME',
'BIOMARKER_RESULT', 'CLONE', 'PERCENTAGE', 'PLATFORM', 'platform']
✘ Can't find label 'PDOT' in model en_pipeline

Now, that's fine, because I hadn't added any PDOT tags. HOWEVER, at some point the lower-case 'platform' tag was added.

I can see every command I ran (and I didn't type it in fresh every time, these were copied commands) on pycharm.

For sure none of them included 'platform'.

I'm using prodigy 1.14.5

Thanks for the background. That’s a bit odd but it does sound like at some point you may have accidentally run once with a lower case platform label. The problem is I can’t reproduce this so it’s hard for us to do much more.

Hopefully if you can modify any examples with the lower case label you should be okay moving forward.

If you do something weird, please provide a fully reproducible example (code and example).

As I mentioned in the earlier linked post, you may want to put all of the label names as a labels.txt file and call them so you’re more confident you can avoid a typo when writing the commands each time. Sorry I can’t help more but let us know if you have further issues.