impact of percentage of evaluation data on performance


There is something I don't quite understand. I don't have many examples in my dataset only 640.
The task asked is simple, find the function of a person in a conversation. The conversation is segmented in blocks of 100 words. On the other hand the transcription is not great (missing words, misspelling, choppy conversation)

AGENT: You don't have a machine, right, because I have a little computer problem. I don't have the data. So you know how many folds you told me to do?
AGENT: Leave
AGENT: 254 folds and the weight less than 35 grams
CUSTOMER: I don't know
AGENT: how many sheets are on the inside
CUSTOMER: It's a little booke. I use a speaker. It's for \u00e7a
AGENT: If you can weigh it.
CUSTOMER: So it's 14 sheets plus the envelope. But it's glossy paper, so it's not necessarily
AGENT: It has to be more than 35 grams in any case.
CUSTOMER: I think so.
AGENT: And you're able to weigh it
CUSTOMER: I don't think so. We don't have a veil, we have to weigh the weight of ah then
CUSTOMER: Wait a minute, my colleague just asked if there was a scale that would make you absolutely happy.
AGENT: All right
CUSTOMER: We'll weigh it
AGENT: It's convenient
CUSTOMER: It's practical, 626 grams
AGENT: Oh, yes, that changes everything. 126 grams. Don't forget to stamp each envelope at the top left with the name and address of the company.
CUSTOMER: then a stamp on the top left with
AGENT: Here you go
CUSTOMER: Well, I'll do it.
AGENT: On the back of the envelope on the top left side
AGENT: Because it's mandatory
AGENT: I'm going to take your contact information because I don't actually have you as a contact.
AGENT: Ms. Michu
AGENT: your first name
CUSTOMER: renato
AGENT: very well your function in the company you do
CUSTOMER: sales assistant

Here the objective is to bring up information "sales assistant"

If I limit the evaluation to 10% so that there is a maximum of training data, I have an F1-score of 0.

If I increase the limit to 25%, I have a much better result with an F1-score of 73.29

What I don't understand is why with less training data, he gets better results? it should be the other way around, right?

Another question is that the fact that the conversation is chopped can degrade the performance of spancat? And if so what should be done to correct the problem, knowing that rebuilding the conversation is not easy.

Thxs for your help

hi @dad766!

I think your sample sizes are still small. 640 x 10% would mean only 64 examples in your evaluation dataset. Not sure if all of your records have spans but maybe you randomly got a weird patch.

There are a few related spaCy GitHub discussions at other possible issues:

Just curious -- have you tried to run train-curve? I suspect the curve will be very noisy as it typically needs a few hundred results. This is thinking about the problem from a different way: how the number of training examples affect model performance.

One important thing -- at some point, you likely should consider creating a dedicated evaluation dataset rather than creating a new evaluation dataset each time. Using --eval-split is great for simple experiments but not great because each time it'll create a new evaluation dataset and that random partitioning can fool you.

One way to do this is use data-to-spacy recipe on your full dataset. This recipe will convert your entire dataset into two spaCy binary files: one for training, one of evaluation. It will also provide a starter config.cfg and labels file that can help speed up training if you use spacy train instead. Remember, prodigy train is just a wrapper for spacy train.

This way you can begin learning more about spaCy config. This can open up a lot of possible ways to configure your model as the GitHub posts mention. My favorite resource on spaCy config is from my colleague @ljvmiranda921's post:

It takes time to learn this (I'm still learning too), but there's huge advantages in the long term to getting comfortable with the spaCy config (and projects too).

Another benefit is that you can run data debug using that config file:

python -m spacy data debug config --verbose

What's great is it will provide some additional info about your spans too:

If your pipeline contains a spancat component, then data debug will also report span characteristics such as the average span length and the span (or span boundary) distinctiveness. The distinctiveness measure shows how different the tokens are with respect to the rest of the corpus using the KL-divergence of the token distributions. To learn more, you can check out Papay et al.’s work on Dissecting Span Identification Tasks with Performance Prediction (EMNLP 2020).

This can help you measure qualities about your span and how that can affect performance.

Potentially but I think there are still other things that could be done (e.g., reframing your task, more data/larger dedicated evaluation). Here's a recent post in spaCy's GitHub issues that discuss ways to optimize spancat performance:

Alternatively, you should also consider modifying your suggester functions. This is a bit easier to experiment with when you move to handling your own config file like I mentioned earlier.

It is important to know that long spans can affect speed/memory, especially when using the default n-grams:

Last, you may also find this post to be helpful on FAQ on tuning hyperparameters/config:

What's great is it goes through a spancat example so it can explain some of the parts of the config file relevant for spancat.

Thank you so much @ryanwesslen for all these leads.

To answer your question all my samples have at least one label. Maybe it's necessary to have samples without any label at all ?

I did the debug data nothing special on the previous label. But on another label where I have a bad score, I get 3 warnings:

What does "138 misaligned tokens in the training data" mean ?

I also did the train-curve. I get the following figure :

From what I understand I still need to add data. I imagine close to the 2000 that are mentioned in the debug data.
I will try to add some training data to get closer to the 1000 samples. Then see how it behaves and if necessary play on the conf, the suggester etc...

but it makes me ask other questions:

  1. What score should I aim for in the case of spancat. I have seen several messages where reaching an F1-score of 70 is already very good. Is it this score that we should aim for? I guess it depends on the task but can we have at least an idea.

  2. I have 30 different labels to extract. For the moment I'm training a model per label to determine how many samples are needed to get a good score for each label. And then take all the samples to make a single train set, re-train it with all the labels and then train a single model with all 30 models. I'm not convinced that creating a single model for my 30 labels is a good idea, what do you think? But specializing 30 models is also a lot, what are the best practices on the subject?

Check out this post.

Misaligned tokens are cases where spacy's tokenization doesn't line up with the tokens in the training data. Before training, it runs the tokenizer on the provided raw text and tries to align those tokens with the provided token orth values. If it can't align the tokens and orths in any useful way, the annotation on those tokens becomes None during training.

I also like this post that provides an example and how bad of a problem it can be. Here's the example they go through:

text = "Susan went to Switzerland."
entities = [(0, 3, "PERSON")]

The problem is "Sus" is not aligned to the tokenization. I haven't dealt as much as this problem. To my understanding these spans will simply be dropped. What may be worth looking into is the examples of what examples are mismatched like this. By default, Prodigy typically will annotations will automatically “snap” to token boundaries so I'm not 100% sure why. But since you're labeling spans, I suspect you may have labeled somewhere that split in-between a token (e.g., due to some small punctuation).

Perhaps look into those examples to see if you can figure out more details. Even better, you may be able to fix them so they're not dropped during training.

This curve looks good! This is what you want and would expect. Yes, it indicates you'll get improvements as you label more. train-curve can sometimes (especially for only a few hundred examples) have some volatility. You may also find this post below that shares some knowledge of ways you can customize train-curve (e.g., to add label performance too).

That's hard to say. Yes, it really depends on the case. A lot of times scores to aim for is more based on your business problem -- e.g., need a model that has x% to make the benefits worth the costs of false positives. But yes, I really like Peter's points in his GitHub post.

Honestly, if you aren't constrained by some business need (e.g., need accuracy at level X), I would then more use tools like train-curve to understand are you "maximizing" accuracy for your given annotation scheme (that is, how you define your spans/entities). The problem is some spans are likely going to be easier for your model to learn than others. That's why you shouldn't expect that all labels will converge to similar accuracies because they each have different "leveling off" accuracy. This is why I like train-curve because once you see that leveling off of the curve, it's more evidence that you've hit your max possible accuracy.

That's another great but tough question. I'd first highlight Prodigy's general philosophy when handling many labels:

If you’re working on a task that involves more than 10 or 20 labels , it’s often better to break the annotation task up a bit more, so that annotators don’t have to remember the whole annotation scheme. Remembering and applying a complicated annotation scheme can slow annotation down a lot, and lead to much less reliable annotations. Because Prodigy is programmable, you don’t have to approach the annotations the same way you want your models to work. You can break up the work so that it’s easy to perform reliably, and then merge everything back later when it’s time to train your models.

That is, in general try to break up tasks to the simplest way. So my gut would generally point more towards specializing 30 models over trying 1 model with 30 labels. This is not just for model training -- e.g., a model trying to learn 30 labels may have issues, especially things like catastrophic forgetting when you're retraining the model -- but also from a UX that it's much easier for a human to make decisions on 1 label at a time than choosing from 30 labels.

I can understand 30 models seems unreasonable. However, I have heard of interesting use cases for things like many class classification where I know of users who have created many binary classifiers. So perhaps there are opportunities where you can group similar labels together to make something in between, like have 3-5 models, each with 6-10 labels?

Alternatively, is there any hierarchical structure in the labels? I really like Ines and Sofie's advice in this post (for NER, but same general idea holds):

Overall, my main suggestion would be to invest in good practices in tracking -- maybe Weights & Biases -- as well as evaluation best practices like the dedicated holdout set and training by config. Also, have you seen the spaCy project repos? There are example of sample projects like this one for spancat vs. ner. I mean this not to consider ner but more for examples of good spaCy project setups.

Hope this helps!

Thank you for your precious help!

I finally found where the problem came from, some special characters trigger a warning. This is the case of the "-" character. In Prodigy no problem I have a token "peut-être" but when I run the debug data command I get the warning.
Is it better that I replace for example all the "-" by " " ?

Example : "peut-être" => "peut être"

Simplifying the problem by breaking it down into smaller tasks has inspired me a lot and yes I think a breakdown is possible.
That would get me down to 15 independent labels. Which I think should also be segmented.
Would another possible segmentation be for example the average number of characters. Let me explain:

  • I have labels composed of only one word
  • I have labels composed of between 1 and 3 words
  • I have longer labels, intentions for example, which will always be between 5 and 20 words
    Is it relevant to try to train models by segmenting them by number of words so in this case 3 models of about 5 labels each or is it more relevant to create 15 different models?

Last thing, some labels are easily identifiable by business rule, for example for phone numbers, there is almost always just before the word "téléphone". So in the end, once you have found a sequence of numbers, it may be easier to search for the word "téléphone" by regular expression than to train a template to do so. This is another way to simplify the problem.
What do you think about it ?

In an attempt to be specific, would I be correct to assume that when you say "label" that you're referring to an entity that you're interested in? I'm assuming so in my answer here.

It's perfectly fine to combine a machine learning approach with a rule based one. In your case, I can certainly see how a telephone number might be retrieved with a rule as well but I would think about other rules that apply. For example;

My number is +316 1234 5678

In this case I might still apply a rule, but the telephone rule wouldn't apply. So you might need to come up with a set of rules if you want to catch many of these edge-cases.

And this is also where ML might become interesting. A ML approach wouldn't require you to construct rules for every edge case, you would instead need to provide it enough data of each edge-case that the statistical approach can figure out the appropriate rules.

Having said all of this, I do need to add a caveat that "it depends". The only way to know for sure which approach works best for your situation is to try it out. My advice would be to always keep iterating until you reach a system that is satisfactory. It's certainly possible that in your final system some of your entities will be detected by rules, others might be detected by a NER model, others by a spancat model and maybe a few will be detected by a postprocessing step that uses all the previous steps as input.

yes, sorry, when I say "label" I mean "entity".

That's what I think too. Initially I wanted to do everything with a single spancat, but considering the complexity it's not necessarily the right approach and it would require a very large dataset.

Speaking of data. I have roughly 600 examples for a label with up to 400 possible values. If I artificially increase the number of data by using examples and just modifying values, would that allow me to increase the scoring?
AGENT: what do you want to buy as a product.
CUSTOMER: I would like to order a lamp

duplicate this example by replacing "a lamp" by "a chair" then "a table" etc...
Would I get better results by artificially increasing my data?

duplicate this example by replacing "a lamp" by "a chair" then "a table" etc...
Would I get better results by artificially increasing my data?

There's a risk in this tactic. It's a good thing to have "more data", but when you simulate data you might push the algorithm to overfit on your simulator at the expense of what users will type. If you have a simulator that resembles reality very well, this might not be a huge issue. But there's a risk that your algorithm will mainly rely on the I would like to buy a {product} template to detect entities.

If you're genuinely worried about not having enough training data then I can see how you might consider this tactic, but we should remember that the best data to train on is data that accurately reflects actual users.

Related: if you're worried about detecting specific values ... wouldn't a patternmatcher help? Feel free to correct me if I'm wrong.

Indeed you are right, I had not seen this possibility in Spacy.

In fact, I really have the impression that it is case by case depending on what we are looking for.
Typically it's true that the PhraseMatcher could work very well in the case of products, since we are specifically looking for the name of the products mentioned in the conversation.
The problem with the PhraseMatcher is that it's an exact search, so in the case of a transcription error it won't return anything.
I will test both approaches and see which one works best.

Thanks a lot for your help!

1 Like

Happy to hear it.

Keep on iterating and reflecting. That's how progress is made!

1 Like