Do we need to relabel everything so the size of the spans is 10 max?
That shouldn't be needed. What I might do is write a small filtering script that create the subset that you're interested in. You can do this from Python by the way by using the dataset API. It would look something like:
from prodigy.components.db import connect
db = connect()
examples = db.get_dataset_examples("my_dataset_name")
From here you can make whatever subset you like in the examples
annotation list.
How do we see the errors that the model makes during training?
Ah, I didn't mean to check during training, but rather afterward.
You can either take the quantitative approach by checking on a validation set which errors are made most frequently. In your case, you could look if this is related to the span length. Another option is to take a more qualitative approach and to just interact with the model manually to see what kinds of texts it gets wrong.
In my experience, it really helps to understand when a model fails if you're looking for inspiration on how to improve it.
Only one person actually did the labeling, so I really am wondering why we are getting a very low score.
It's impossible to say, but it might be good to check if the annotations are consistent. If possible, I might try to get a second person to annotate some of these examples as well just to see if both annotators agree. You can use the review recipe to help with this. If it turns out that there's disagreement between annotators then it might help to take a step back and wonder if the definitions on what needs to be annotated are clear.
Some of the examples that you've given seem like good candidates to kickstart a discussion. Let's consider these few:
- questions about my medication
- have a question
- I just had a few questions
When I look at these two examples I immediately wonder ... when do you include the verb "have/had" to the span and when don't you? When do you include "I" to the span? Is there a clear definition for this? If not, it might explain why the model has trouble predicting some of these candidates.
This is all a bit speculative though, merely based on a quick impression. It's certainly possible that there's another reason for your model's performance, but it seems like a valid thing to consider.
I hope this helps. Let me know if you have follow-up questions!