impact of percentage of evaluation data on performance

Check out this post.

Misaligned tokens are cases where spacy's tokenization doesn't line up with the tokens in the training data. Before training, it runs the tokenizer on the provided raw text and tries to align those tokens with the provided token orth values. If it can't align the tokens and orths in any useful way, the annotation on those tokens becomes None during training.

I also like this post that provides an example and how bad of a problem it can be. Here's the example they go through:

text = "Susan went to Switzerland."
entities = [(0, 3, "PERSON")]

The problem is "Sus" is not aligned to the tokenization. I haven't dealt as much as this problem. To my understanding these spans will simply be dropped. What may be worth looking into is the examples of what examples are mismatched like this. By default, Prodigy typically will annotations will automatically “snap” to token boundaries so I'm not 100% sure why. But since you're labeling spans, I suspect you may have labeled somewhere that split in-between a token (e.g., due to some small punctuation).

Perhaps look into those examples to see if you can figure out more details. Even better, you may be able to fix them so they're not dropped during training.

This curve looks good! This is what you want and would expect. Yes, it indicates you'll get improvements as you label more. train-curve can sometimes (especially for only a few hundred examples) have some volatility. You may also find this post below that shares some knowledge of ways you can customize train-curve (e.g., to add label performance too).

That's hard to say. Yes, it really depends on the case. A lot of times scores to aim for is more based on your business problem -- e.g., need a model that has x% to make the benefits worth the costs of false positives. But yes, I really like Peter's points in his GitHub post.

Honestly, if you aren't constrained by some business need (e.g., need accuracy at level X), I would then more use tools like train-curve to understand are you "maximizing" accuracy for your given annotation scheme (that is, how you define your spans/entities). The problem is some spans are likely going to be easier for your model to learn than others. That's why you shouldn't expect that all labels will converge to similar accuracies because they each have different "leveling off" accuracy. This is why I like train-curve because once you see that leveling off of the curve, it's more evidence that you've hit your max possible accuracy.

That's another great but tough question. I'd first highlight Prodigy's general philosophy when handling many labels:

If you’re working on a task that involves more than 10 or 20 labels , it’s often better to break the annotation task up a bit more, so that annotators don’t have to remember the whole annotation scheme. Remembering and applying a complicated annotation scheme can slow annotation down a lot, and lead to much less reliable annotations. Because Prodigy is programmable, you don’t have to approach the annotations the same way you want your models to work. You can break up the work so that it’s easy to perform reliably, and then merge everything back later when it’s time to train your models.

That is, in general try to break up tasks to the simplest way. So my gut would generally point more towards specializing 30 models over trying 1 model with 30 labels. This is not just for model training -- e.g., a model trying to learn 30 labels may have issues, especially things like catastrophic forgetting when you're retraining the model -- but also from a UX that it's much easier for a human to make decisions on 1 label at a time than choosing from 30 labels.

I can understand 30 models seems unreasonable. However, I have heard of interesting use cases for things like many class classification where I know of users who have created many binary classifiers. So perhaps there are opportunities where you can group similar labels together to make something in between, like have 3-5 models, each with 6-10 labels?

Alternatively, is there any hierarchical structure in the labels? I really like Ines and Sofie's advice in this post (for NER, but same general idea holds):

Overall, my main suggestion would be to invest in good practices in tracking -- maybe Weights & Biases -- as well as evaluation best practices like the dedicated holdout set and training by config. Also, have you seen the spaCy project repos? There are example of sample projects like this one for spancat vs. ner. I mean this not to consider ner but more for examples of good spaCy project setups.

Hope this helps!