impact of percentage of evaluation data on performance

hi @dad766!

I think your sample sizes are still small. 640 x 10% would mean only 64 examples in your evaluation dataset. Not sure if all of your records have spans but maybe you randomly got a weird patch.

There are a few related spaCy GitHub discussions at other possible issues:

Just curious -- have you tried to run train-curve? I suspect the curve will be very noisy as it typically needs a few hundred results. This is thinking about the problem from a different way: how the number of training examples affect model performance.

One important thing -- at some point, you likely should consider creating a dedicated evaluation dataset rather than creating a new evaluation dataset each time. Using --eval-split is great for simple experiments but not great because each time it'll create a new evaluation dataset and that random partitioning can fool you.

One way to do this is use data-to-spacy recipe on your full dataset. This recipe will convert your entire dataset into two spaCy binary files: one for training, one of evaluation. It will also provide a starter config.cfg and labels file that can help speed up training if you use spacy train instead. Remember, prodigy train is just a wrapper for spacy train.

This way you can begin learning more about spaCy config. This can open up a lot of possible ways to configure your model as the GitHub posts mention. My favorite resource on spaCy config is from my colleague @ljvmiranda921's post:

It takes time to learn this (I'm still learning too), but there's huge advantages in the long term to getting comfortable with the spaCy config (and projects too).

Another benefit is that you can run data debug using that config file:

python -m spacy data debug config --verbose

What's great is it will provide some additional info about your spans too:

If your pipeline contains a spancat component, then data debug will also report span characteristics such as the average span length and the span (or span boundary) distinctiveness. The distinctiveness measure shows how different the tokens are with respect to the rest of the corpus using the KL-divergence of the token distributions. To learn more, you can check out Papay et al.’s work on Dissecting Span Identification Tasks with Performance Prediction (EMNLP 2020).

This can help you measure qualities about your span and how that can affect performance.

Potentially but I think there are still other things that could be done (e.g., reframing your task, more data/larger dedicated evaluation). Here's a recent post in spaCy's GitHub issues that discuss ways to optimize spancat performance:

Alternatively, you should also consider modifying your suggester functions. This is a bit easier to experiment with when you move to handling your own config file like I mentioned earlier.

It is important to know that long spans can affect speed/memory, especially when using the default n-grams:

Last, you may also find this post to be helpful on FAQ on tuning hyperparameters/config:

What's great is it goes through a spancat example so it can explain some of the parts of the config file relevant for spancat.