Losing spancat labels when training after using prodigy db-merge

Hello there. These are great applications, thank you for all of the effort and support.

I'm creating spancats on multiple types of documents, but looking for the same labels across them. For both ease of annotation and to see if it would make much of a difference, I've trained each one separately. When I evaluate each one independently, I'm getting good R,P, and F scores, but when I merge them, I'm losing entire labels. What's really odd, is that both of the datasets have annotations for "PATIENT" and "DOB" - if I was losing labels that only appear in one (e.g. ENCDATE) I would, maybe, understand. I have 27 different "databases" -- but to simplify what I'm seeing, I've reproduced it with 2.

My evaluation of the first "allergy" documents:

My evaluation of the second "audiology" documents:

The command I'm using to merge the data:
prodigy db-merge FAX_ALLERGY,FAX_AUDIOLOGY FAX_MERGED_AUDIOLOGY_ALLERGY

Q: Does the order matter?

I'm then converting to spacy:
python -m prodigy data-to-spacy spacy --spancat FAX_MERGED_AUDIOLOGY_ALLERGY --eval-split 0.2 --verbose

Then training:
python -m spacy train spacy/config.cfg --output ./model --paths.train spacy/train.spacy --paths.train spacy/train.spacy --paths.dev spacy/dev.spacy --gpu-id 0 --verbose

Then evaluating:
python -m spacy evaluate ./model/model-best ./spacy/dev.spacy --spans-key PATIENT,DOB,MRN,SSN,INSPOLN,INSGRPN,EMAIL,PHONE,ENCDATE,CLAIMN -o ./evaluation.json

The output of evaluation.json on the combined model:

I'm losing labels -- what am I doing incorrectly (or simply don't understand)?

Note: When I combined all 27 databases, I lose fewer labels, but I'm still losing some, which is less than ideal.

Thanks in advance! I'm sure this is PEBKAC, but I've been playing around for a few days and I'm stuck.

HOAS. I might just be an idiot... Of course, I realized this AFTER I posted.. I checked my config.cfg
and noticed the label reader was pointing to a non-existent file. Very odd that it would find any labels at all. Let's see if the corrected labels fix things. Standby.

[initialize.components.textcat.labels]
@readers = "spacy.read_labels.v1"
path = "spacy/labels/spancat.json"
require = true

1 Like

Thanks for the update.

Just to make sure, have you seen these recent posts?

One other debugging tip - did you know you can view db-merge or data-to-spacy recipes (or any other built-in recipes)? This way you can see exactly what they're doing and debug/modify them.

Just run prodigy stats, then look in the folder with your Location:, and find the recipes folder. For db-merge, it'll be in the commands.py script and data-to-spacy is in the train.py.

Hope this helps!

Sadly textcat labels and spancat aren't the same thing. I think it's far more likely that i need to edit the merge function of either or both db-merge and/or data-to-spacy.

Ok, after a day of attempts, I can confirm that the "patch" to the db-merge and prodigy-to-spacy creates train.spacy and dev.spacy docbins that have spans which include all of my labels.

allspans_train = {}
for doc in docs_train:
    for span in doc.spans["sc"]:                
        if span.label_ not in allspans_train:
            allspans_train[span.label_]=1        
        else:
            allspans_train[span.label_]= int(allspans_train[span.label_]) + 1        
    
allspans_train 

{'ENCDATE': 695,
 'PATIENT': 3550,
 'DOB': 1791,
 'INSPOLN': 682,
 'PHONE': 251,
 'INSGRPN': 358,
 'MRN': 501,
 'SSN': 44,
 'EMAIL': 17,
 'CLAIMN': 78}

And the testing data:

allspans_dev = {}
for doc in docs_dev:
    for span in doc.spans["sc"]:                
        if span.label_ not in allspans_dev:
            allspans_dev[span.label_]=1        
        else:
            allspans_dev[span.label_]= int(allspans_dev[span.label_]) + 1        
    
allspans_dev 

{'PATIENT': 809,
 'MRN': 93,
 'DOB': 421,
 'ENCDATE': 166,
 'PHONE': 50,
 'SSN': 5,
 'INSPOLN': 148,
 'INSGRPN': 72,
 'EMAIL': 4,
 'CLAIMN': 18}

So - we are now getting all of the labels in the dataset merged.
BUT -- still no dice on training. (Tried using prodigy and on the exported spacy files that I verified contains the spans)

python -m spacy evaluate ./spacy_model/model-best ./spacy/dev.spacy --gpu-id 0 --spans-key PATIENT,DOB,MRN,SSN,INSPOLN,INSGRPN,EMAIL,PHONE,ENCDATE,CLAIMN -o ./evaluation.json

{
  "token_acc":1.0,
  "token_p":1.0,
  "token_r":1.0,
  "token_f":1.0,
  "spans_sc_p":0.7686628384,
  "spans_sc_r":0.5246360582,
  "spans_sc_f":0.6236272879,
  "spans_sc_per_type":{
    "DOB":{
      "p":0.8625954198,
      "r":0.8052256532,
      "f":0.8329238329
    },
    "MRN":{
      "p":0.0,
      "r":0.0,
      "f":0.0
    },
    "PHONE":{
      "p":0.0,
      "r":0.0,
      "f":0.0
    },
    "PATIENT":{
      "p":0.7239709443,
      "r":0.739184178,
      "f":0.7314984709
    },
    "ENCDATE":{
      "p":0.0,
      "r":0.0,
      "f":0.0
    },
    "SSN":{
      "p":0.0,
      "r":0.0,
      "f":0.0
    },
    "INSPOLN":{
      "p":0.0,
      "r":0.0,
      "f":0.0
    },
    "INSGRPN":{
      "p":0.0,
      "r":0.0,
      "f":0.0
    },
    "EMAIL":{
      "p":0.0,
      "r":0.0,
      "f":0.0
    },
    "CLAIMN":{
      "p":0.0,
      "r":0.0,
      "f":0.0
    }
  },
  "speed":37372.5665677684
}

Help :slight_smile:

Could you run data debug?

This post seems relevant:

If you don't see anything in data debug, any chance you have?

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

There are a few more related posts on the spaCy GitHub discussions. Since you're problem is more spaCy than Prodigy, you might want to check that repo for help.

Hope this helps!

24 hours of trying all sorts of things later, I'm still stuck. I've tried a few dozen ways of training (CPU, GPU, efficiency, accuracy, etc) and I'm still losing entire labels. I'm parsing through the spacy discussions, but so far have not come across anything similar. I've also reproduced the same results on 3 different machines with 3 different GPUs just to see if it was something environmental. Any ideas?

hi @dcane,

Sorry you're having issues, especially with so much work. As I mentioned in my last post, can you check your what your config file provides for your training.core_weights like:

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

Like the thread mentioned:

It's possible to have zero scores in training but actually train a model if your logging is configured incorrectly.

Since it seems like your issue is more spaCy, would it be possible to post on the spaCy GitHub discussion? Be sure to post the entire config file as that will help debug the problem. That forum includes spaCy core developers who can help you a lot faster since it's seemingly like your core issue is spaCy, not Prodigy.

Good idea. FWIW - I'm not upset at all. This is learning :slight_smile: I did confirm that BOTH the prodigy-auto generated config.cfg (which happens when you python -m prodigy data-to-spacy spacy --spancat FAX_MERGED_AUDIOLOGY_ALLERGY --eval-split 0.2 --verbose) looks correct. I've run a diff against that and a blank spancat config that I populated with (python -m spacy init fill-config base_config.cfg config.cfg), and BOTH of them have:

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

The only differences between them are that the spacy config generated from prodigy has this at the bottom:


[initialize.components.spancat]

[initialize.components.spancat.labels]
@readers = "spacy.read_labels.v1"
path = "spacy/labels/spancat.json"


And fill-config from space defaults n-grames to:

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

Whereas Prodigy uses:

[components.spancat.suggester]
@misc = "spacy.ngram_range_suggester.v1"
min_size = 1
max_size = 9

But that shouldn't make much of a difference. I'm running one complete end-to-end test to verify that all of the steps I've taken are correct, and then I'm going to post my findings on the spacy discussion forum.

Something strange is going on. When I train using the .cfg from prodigy vs. the spacy one I'm getting different results. The prodigy-based .cfg file is losing only 1 label, while the spacy filled-in init is losing 2 labels. Same exact spacy docbins. Both are losing the "MRN" (which is in both datasets), but the spacy filled in config is also losing ENCDATE (which is only in 1 dataset). I've triple checked the docbins, and the labels are 100% there:

train.spacy

{'ENCDATE': 692,
 'PATIENT': 2454,
 'DOB': 1588,
 'EMAIL': 19,
 'MRN': 491,
 'PHONE': 220,
 'INSPOLN': 94,
 'INSGRPN': 68,
 'SSN': 37,
 'CLAIMN': 3}

dev.spacy

{'PATIENT': 543,
 'ENCDATE': 162,
 'DOB': 360,
 'MRN': 95,
 'SSN': 7,
 'PHONE': 60,
 'INSPOLN': 28,
 'INSGRPN': 14,
 'EMAIL': 3}

The most common span tokens above in debug data look unexpected, especially having email addresses be frequent enough to show up in this list, so maybe there is still some issue with the underlying data/annotation/merging? (Or maybe it's just related to some anonymization?)

For email, you should consider using a pattern with LIKE_EMAIL, and for SSN or phone numbers patterns may also work better in addition to or instead of spancat.

The ngram lengths could make a difference if there are entity types like phone numbers that are typically always 4-grams or longer. You can try the exact same suggester settings from prodigy to see if that makes a difference? (Prodigy looks at your annotation in order to pick reasonable ngram lengths for the suggester, but spacy just uses 1-3 by default without analyzing your annotation.)

Lots and lots more testing...

I'm starting to wonder - is this a "feature" of NER - not a problem with prodigy/spacy? So, let's say I have 500 annotated faxes of "allergy" - out of 500, let's say most if not all have a PATIENT and DOB, some have an MRN, but very few have ENCDATE, SSN, PHONE, etc.

When I train allergy by itself, it only can recall the first 3 things it's seen the most: PATIENT, DOB, AND MRN

When I look at audiology, we almost always have an ENCDATE, PATIENT, DOB, and MRN, but little else and the training evaluation reflects that.

When I MERGE both sets together, and then train, my scores where the label counts roughly overlap in frequency are similar, but I'm losing labels that have low label counts across the newly combined set.

/me waits for someone who actually knows what he/she is doing for a slap across the head

So, even though in aggregate I have more examples of each of the lesser annotated fields, the overall % of total examples across the dataset is low, and those labels are dropping. Expected?

Hi @dcane,

It might be the case the the model didn't learn the dropped categories well enough to generalize on the dev set.
Have you tried what happens of you test on the train dataset? If you can recall all labels while testing on the train set, at least you'd know there's nothing structurally wrong with the data. If you can confirm that, the next thing I'd do is to try to upsample the underrpresented categories other than emails, SSNs and phone numbers. For emails, SSNs and phone numbers I'd follow @adriane 's advice on using patterns rather than spancat. Let us kow how it goes!