How to combine results of simpler annotation runs together using spanCAT

I'm trying to figure out how to approach annotating customer comments and appreciate you pointing me in the right direction. My goal is to classify each comment with multiple classes if they are reflected in the comment and I believe spanCAT is the right recipe.
Roughly there are six high level classes each comment could contain - printer setup, using the mobile app, subscription experience, ink cartridges, print quality and customer support. Within each of those there will be sub-classes but for now I'd be happy to just be able to classify if the comment was positive or negative within each of those categories.

I started with one simple task as you recommend and annotated around 500 comments using spans.manual and span.correct using GOOD SUBSCRIPTION, BAD SUBSCRIPTION as my labels.

A couple areas I would like advice: 1) As I was going through the comments, if a comment didn't include feedback on the subscription I skipped it. I'm thinking that I would have a chance to come back and annotate that when I added other labels in the spirit of not trying to do it all at once. Is this a good approach? If so my second question is: How do I go back through the comments and add some of the other labels - do I simply run spanCAT again on the same dataset and add new labels and do I have to include the previous labels or not each run through - hopefully this question makes sense.

Appreciate the feedback and any corrections to thinking or approach :slight_smile:

Some ideas that popped into my mind as I read this.

Roughly there are six high level classes each comment could contain - printer setup, using the mobile app, subscription experience, ink cartridges, print quality and customer support.

Is it possible that a document is about two of these classes at the same time? I can certainly imagine that a subscription experience could happen on a mobile app, just like an issue with an ink cartridge might affect the print quality.

Within each of those there will be sub-classes but for now I'd be happy to just be able to classify if the comment was positive or negative within each of those categories.

What about neutral? There's a lot of gray area between positive and negative. You could choose to model the "sentiment" as a spectrum where 0 means negative, 1 means positive and 0.5 represents the halfway point. But you could also model "positive" as a separate class and "negative" as another one. I'm mentioning this because in some applications you're less interested in "gray" cases and only care about the obvious positive/negative ones.

I started with one simple task as you recommend and annotated around 500 comments using spans.manual and span.correct using GOOD SUBSCRIPTION, BAD SUBSCRIPTION as my labels.

Spancat could work, but I'm curious ... is there a reason why you didn't go for text classification? I can imagine that support tickets come in all sorts of shapes and sizes ... and you may care less about "where" in the text a positive sentence appears if you can already demonstrate that the overall sentiment of the text is positive. This depends a bit on the application though, so feel free to take this thought with a grain of salt.

How do I go back through the comments and add some of the other labels - do I simply run spanCAT again on the same dataset and add new labels and do I have to include the previous labels or not each run through - hopefully this question makes sense.

There are a few ways to go about something like this. Here's some ideas that might help:

  • You can totally opt to have seperate datasets for seperate labels if you choose to go for non-exclusive classes. The Prodigy train recipe can then look at different datasets as it's training models too. The nice thing about this setup is that it's fairly easy to add a new label of interest and it also keeps things saperate. My gut feeling is that this approach might work very well for you.
  • Alternatively, it might also make sense to have a single interface with all the labels if you're pretty sure that the labels won't change over time. Eventually a custom recipe might make sense here. There's a nice example of this on Youtube here where I combine spancat with choice interfaces.

Let me know if my feedback prompts any extra follow-up questions!

1 Like

Thank you for the helpful feedback Vincent! Here's some follow-up clarification and new questions :slight_smile:
My high level goal is to be able to track customer comments on a regular time interval to show trends in the class counts. So maybe we see in Q1 that 40% of comments are referring to 'Subscriptions' and that 90% of those are negative sentiment (BAD SUBSCRIPTION counts). In a dashboard we might be able to model the impact of underlying changes to the subscription experience and see the counts for BAD subscription reverse course to GOOD subscription say in Q2. I'm assuming that the class categories will be fairly constant at least over the coming year or two.

Your question:
Is it possible that a document is about two of these classes at the same time? I can certainly imagine that a subscription experience could happen on a mobile app, just like an issue with an ink cartridge might affect the print quality.

This is true. If we get 5000 comments then it could be more than half refer to subscriptions and have bad sentiment. But within the same comment a customer also mentions printer setup was easy and support was good. That's why I thought spanCAT would be the right approach - to tease out spans referring to more than one class in the same comment.

Your comment:
What about neutral? There's a lot of gray area between positive and negative. You could choose to model the "sentiment" as a spectrum where 0 means negative, 1 means positive and 0.5 represents the halfway point. But you could also model "positive" as a separate class and "negative" as another one. I'm mentioning this because in some applications you're less interested in "gray" cases and only care about the obvious positive/negative ones.

I want to be able to tag the sentiment to the specific span, not the whole document to allow for comments that have something nice to say about one thing and are angry about another.

What's interesting is that in addition to the comment, the customer has given their NPS score which if you're familiar ends up being a useful sentiment class label where they fall into PROMOTER, PASSIVE, DETRACTOR by their own submission. Interestingly, promoters comments will be generally all positive and detractors will be mostly negative but passives have more conflicting sentiments and transformer based models I've applied have less than .5 confidence predicting those comments. I'm thinking the spanCAT will help tease that out.

Your comment:
You can totally opt to have seperate datasets for seperate labels if you choose to go for non-exclusive classes. The Prodigy train recipe can then look at different datasets as it's training models too. The nice thing about this setup is that it's fairly easy to add a new label of interest and it also keeps things saperate. My gut feeling is that this approach might work very well for you.

This is really helpful but I'm trying to form a mental model for how the model works with a series of datasets to make a classification prediction! As I mentioned, I'm skipping comments that don't have mentions of subscription in them on the first dataset run. If I start a new run with labels for another class then I'll have comments annotated for that class and skip comments. After 6 runs and 6 separate datasets, do I end up with a model that integrates results for a comment across classes?

Your comment:
Alternatively, it might also make sense to have a single interface with all the labels if you're pretty sure that the labels won't change over time. Eventually a custom recipe might make sense here. There's a nice example of this on Youtube here where I combine spancat with choice interfaces.

Thanks for this video - I'm thinking that having all the labels present from the start might be the best. Based on your thoughts on the separate annotation runs and the final model output I'll take a look at this more.

Thank you Vincent - appreciate your input and learning so much watching our videos :slight_smile:

There are many ways to make a machine learning pipeline. But if you use the standard train command from Prodigy then you'll be able to get a single spaCy pipeline at the end that tries to predict every label in the dataset. To put it in diagram form:

Note that you're also free to use your own Python code with your own machine learning model to script the training loop yourself as well. So if you have extra needs, you can totally go beyond the base settings that the prodigy train command offers.

That's super helpful thank you! Your sketches are awesome always. I currently have 1250 customer comments annotated using the single dataset approach with 14 class labels. Using spans manual, train and correct I'm getting F score of .37 so I still have quite a bit more annotating to do. There's quite an imbalance in the frequency of each class label as well. One begins to wonder if the data needs to pre-processed in some clever ways to reduce noise - the comments are all over the place. As a beginner of this can I ask: if I train a model using spancat and evaluate a new text input does it output the predicted spans and their associated probabilities? Is there an example of this for text spans as opposed to NER?

This F score is an average of scores across your labels. Are these certain classes that perform much better than others? It could be good to either focus on the good performing spans, if the business case allows for it you might get away with just having a few labels that perform very well. Or you could focus on the worst performing labels and try to get more training data for those specifically.

That usually doesn't help, yeah. You could try to use your trained pipeline to make predictions and that you only select a relevant subset. If the model predicts a rare label in your unlabelled dataset, you might want to annotate those first. I usually write the logic for this in a Jupyter notebook that writes a small .jsonl file with the examples as output.

Does this help?