Low score in spancat training

joebuckle · February 10, 2023, 6:00am

Good day,

We only have two labels in our dataset for training spancat, but we are getting a very low score (.28) for a training data of 1000 examples. In one of our labels, we have tagged long phrases (between 2 to 10 words). Would that affect the result/score? Is it better to have short phrases for spancat? Or it is because we have too few samples/training data?

Thanks.

koaning · February 10, 2023, 2:02pm

What's the task that you're training spancat for? It helps to understand the linguistic complexity of the task before worrying about hyperparameters. Also, is the task in English? Anything you can share about the dataset?

joebuckle · February 10, 2023, 3:33pm

We are classifying transcribed voicemails. Specifically, regarding medications. So, for example we are tagging the medication name, if they are asking for a refill, having an issue/reaction, have a usage question, etc.

So, we used the span to tag "Rx type" (Drug/medication names) and "Rx details" (I need a refill, I have a question, I am having a symptom, how much do I take, etc.)

joebuckle · February 12, 2023, 2:20pm

As a follow up, would it matter if we just tag the most important keywords in a phrase/span, and try to skip as much stop words as possible? Right now, the size of our spans is between 1 to 31 (as reported by spacy debug data).

Is span categorization the best way to handle this? Are there other approaches we can use to find the Rx Type and Rx Details?

koaning · February 13, 2023, 1:15pm

It might help to see some examples of the super long spans. I'm a bit suprised that a user might need 31 tokens to declare that they have need of a mere refill or question.

There are a few thoughts in my mind that might inspire progress though.

You can choose to use spancat for a specific problem and to use NER for other parts. The drug/medication name sounds like something that NER might be able to handle well because it sounds like a span that has a clear start/end. Spancat is designed for longer spans in general, where it's perhaps a bit less clear where to start/end. If you're interested, this blogpost tries to explain some of the most important differences/usecases for both. We also have a convenient overview on the Prodigy docs here.
Do you have a distribution over these lengths? If a span of 31 tokens appears once in a million times then it might be fine to worry less about spans of that length.

joebuckle · February 13, 2023, 2:41pm

Here is the partial result of using spacy debug data:

1 (682 spans)
2 (252 spans)
3 (193 spans)
4 (209 spans)
5 (240 spans)
6 (240 spans)
7 (219 spans)
8 (201 spans)
9 (165 spans)
10 (99 spans)
11 (88 spans)
12 (66 spans)
13 (27 spans)
14 (18 spans)
15 (19 spans)
16 (9 spans)
17 (9 spans)
18 (4 spans)
19 (2 spans)
20 (1 spans)
21 (1 spans)
22 (1 spans)
31 (1 spans)

Also does it matter if you use spacy.blank("en") when using add_tokens for our custom recipe?

joebuckle · February 13, 2023, 2:50pm

And here is the result of us training with en_core_web_lg as our base model:

koaning · February 13, 2023, 4:03pm

When I feed your numbers into a small spreadsheet then I get the following:

   span_size     n percentage
 1         1   682      0.248
 2         2   252      0.340
 3         3   193      0.410
 4         4   209      0.487
 5         5   240      0.574
 6         6   240      0.661
 7         7   219      0.741
 8         8   201      0.814
 9         9   165      0.874
10        10    99      0.910

That means that if you just limit yourself to spans of up to size 10, you'll still be covering ~91% of the cases. That might still be fine as a starting point and seems like a much easier subset to focus on initially.

That said, is there maybe an example of a very long span that you can share? I'm still a bit surpised and I'd like to understand the nature of your problem better. Out of curiosity, when you look at the mistakes that your model makes, are there any patterns?

Related: how was this data annotated? By a team? Are you 100% sure there are no label errors?

koaning · February 13, 2023, 4:05pm

Assuming that you're using English, that should be totally fine. All of the pretrained pipelines use the same tokeniser as blank:en.

joebuckle · February 13, 2023, 5:52pm

Do we need to relabel everything so the size of the spans is 10 max? How do we see the errors that the model makes during training? Only one person actually did the labeling, so I really am wondering why we are getting a very low score.

joebuckle · February 14, 2023, 4:22am

Here are some examples of the spans (Rx Details) we have tagged:

need to get more refills for my prescription
start my medication on Friday
I just had a few questions
CVS specialty pharmacy
need prior authorization
which I should do 1st
need to order it tomorrow
questions about my medication
request a script
waiting for a prescription
does n't have the prior authorization
help with my prescriptions
prescriptions will run out before my appointment
have one refill left
have a question

koaning · February 14, 2023, 9:38am

Do we need to relabel everything so the size of the spans is 10 max?

That shouldn't be needed. What I might do is write a small filtering script that create the subset that you're interested in. You can do this from Python by the way by using the dataset API. It would look something like:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset_examples("my_dataset_name")

From here you can make whatever subset you like in the examples annotation list.

How do we see the errors that the model makes during training?

Ah, I didn't mean to check during training, but rather afterward.

You can either take the quantitative approach by checking on a validation set which errors are made most frequently. In your case, you could look if this is related to the span length. Another option is to take a more qualitative approach and to just interact with the model manually to see what kinds of texts it gets wrong.

In my experience, it really helps to understand when a model fails if you're looking for inspiration on how to improve it.

Only one person actually did the labeling, so I really am wondering why we are getting a very low score.

It's impossible to say, but it might be good to check if the annotations are consistent. If possible, I might try to get a second person to annotate some of these examples as well just to see if both annotators agree. You can use the review recipe to help with this. If it turns out that there's disagreement between annotators then it might help to take a step back and wonder if the definitions on what needs to be annotated are clear.

Some of the examples that you've given seem like good candidates to kickstart a discussion. Let's consider these few:

questions about my medication
have a question
I just had a few questions

When I look at these two examples I immediately wonder ... when do you include the verb "have/had" to the span and when don't you? When do you include "I" to the span? Is there a clear definition for this? If not, it might explain why the model has trouble predicting some of these candidates.

This is all a bit speculative though, merely based on a quick impression. It's certainly possible that there's another reason for your model's performance, but it seems like a valid thing to consider.

I hope this helps. Let me know if you have follow-up questions!

Topic		Replies	Views
impact of percentage of evaluation data on performance spacy , spancat	9	962	December 13, 2022
Spancat is not trained spancat	12	1118	July 27, 2022
spancat with really large spans? (Identify sections in text) spancat	9	933	March 29, 2023
Text length for spancat model usage , spancat	8	364	April 11, 2023
Span Cat Annotations and Incorrect Predictions spacy , spancat	4	850	June 8, 2023

Low score in spancat training

Related topics