Low score in spancat training

Good day,

We only have two labels in our dataset for training spancat, but we are getting a very low score (.28) for a training data of 1000 examples. In one of our labels, we have tagged long phrases (between 2 to 10 words). Would that affect the result/score? Is it better to have short phrases for spancat? Or it is because we have too few samples/training data?


What's the task that you're training spancat for? It helps to understand the linguistic complexity of the task before worrying about hyperparameters. Also, is the task in English? Anything you can share about the dataset?

We are classifying transcribed voicemails. Specifically, regarding medications. So, for example we are tagging the medication name, if they are asking for a refill, having an issue/reaction, have a usage question, etc.

So, we used the span to tag "Rx type" (Drug/medication names) and "Rx details" (I need a refill, I have a question, I am having a symptom, how much do I take, etc.)

As a follow up, would it matter if we just tag the most important keywords in a phrase/span, and try to skip as much stop words as possible? Right now, the size of our spans is between 1 to 31 (as reported by spacy debug data).

Is span categorization the best way to handle this? Are there other approaches we can use to find the Rx Type and Rx Details?

It might help to see some examples of the super long spans. I'm a bit suprised that a user might need 31 tokens to declare that they have need of a mere refill or question.

There are a few thoughts in my mind that might inspire progress though.

  1. You can choose to use spancat for a specific problem and to use NER for other parts. The drug/medication name sounds like something that NER might be able to handle well because it sounds like a span that has a clear start/end. Spancat is designed for longer spans in general, where it's perhaps a bit less clear where to start/end. If you're interested, this blogpost tries to explain some of the most important differences/usecases for both. We also have a convenient overview on the Prodigy docs here.
  2. Do you have a distribution over these lengths? If a span of 31 tokens appears once in a million times then it might be fine to worry less about spans of that length.

Here is the partial result of using spacy debug data:

  • 1 (682 spans)
  • 2 (252 spans)
  • 3 (193 spans)
  • 4 (209 spans)
  • 5 (240 spans)
  • 6 (240 spans)
  • 7 (219 spans)
  • 8 (201 spans)
  • 9 (165 spans)
  • 10 (99 spans)
  • 11 (88 spans)
  • 12 (66 spans)
  • 13 (27 spans)
  • 14 (18 spans)
  • 15 (19 spans)
  • 16 (9 spans)
  • 17 (9 spans)
  • 18 (4 spans)
  • 19 (2 spans)
  • 20 (1 spans)
  • 21 (1 spans)
  • 22 (1 spans)
  • 31 (1 spans)

Also does it matter if you use spacy.blank("en") when using add_tokens for our custom recipe?

And here is the result of us training with en_core_web_lg as our base model:

When I feed your numbers into a small spreadsheet then I get the following:

   span_size     n percentage
 1         1   682      0.248
 2         2   252      0.340
 3         3   193      0.410
 4         4   209      0.487
 5         5   240      0.574
 6         6   240      0.661
 7         7   219      0.741
 8         8   201      0.814
 9         9   165      0.874
10        10    99      0.910

That means that if you just limit yourself to spans of up to size 10, you'll still be covering ~91% of the cases. That might still be fine as a starting point and seems like a much easier subset to focus on initially.

That said, is there maybe an example of a very long span that you can share? I'm still a bit surpised and I'd like to understand the nature of your problem better. Out of curiosity, when you look at the mistakes that your model makes, are there any patterns?

Related: how was this data annotated? By a team? Are you 100% sure there are no label errors?

Assuming that you're using English, that should be totally fine. All of the pretrained pipelines use the same tokeniser as blank:en.

Do we need to relabel everything so the size of the spans is 10 max? How do we see the errors that the model makes during training? Only one person actually did the labeling, so I really am wondering why we are getting a very low score. :frowning:

Here are some examples of the spans (Rx Details) we have tagged:

  • need to get more refills for my prescription
  • start my medication on Friday
  • I just had a few questions
  • CVS specialty pharmacy
  • need prior authorization
  • which I should do 1st
  • need to order it tomorrow
  • questions about my medication
  • request a script
  • waiting for a prescription
  • does n't have the prior authorization
  • help with my prescriptions
  • prescriptions will run out before my appointment
  • have one refill left
  • have a question

Do we need to relabel everything so the size of the spans is 10 max?

That shouldn't be needed. What I might do is write a small filtering script that create the subset that you're interested in. You can do this from Python by the way by using the dataset API. It would look something like:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset_examples("my_dataset_name")

From here you can make whatever subset you like in the examples annotation list.

How do we see the errors that the model makes during training?

Ah, I didn't mean to check during training, but rather afterward.

You can either take the quantitative approach by checking on a validation set which errors are made most frequently. In your case, you could look if this is related to the span length. Another option is to take a more qualitative approach and to just interact with the model manually to see what kinds of texts it gets wrong.

In my experience, it really helps to understand when a model fails if you're looking for inspiration on how to improve it.

Only one person actually did the labeling, so I really am wondering why we are getting a very low score.

It's impossible to say, but it might be good to check if the annotations are consistent. If possible, I might try to get a second person to annotate some of these examples as well just to see if both annotators agree. You can use the review recipe to help with this. If it turns out that there's disagreement between annotators then it might help to take a step back and wonder if the definitions on what needs to be annotated are clear.

Some of the examples that you've given seem like good candidates to kickstart a discussion. Let's consider these few:

  • questions about my medication
  • have a question
  • I just had a few questions

When I look at these two examples I immediately wonder ... when do you include the verb "have/had" to the span and when don't you? When do you include "I" to the span? Is there a clear definition for this? If not, it might explain why the model has trouble predicting some of these candidates.

This is all a bit speculative though, merely based on a quick impression. It's certainly possible that there's another reason for your model's performance, but it seems like a valid thing to consider.

I hope this helps. Let me know if you have follow-up questions!