Ignored sentences for text classification

Hi, I’m using prodigy to do text classification. I learnt from other posts that ignored sentences will not be used at all in either training or evaluation process. My understanding is I need to minimize the number ignored sentences in training/evaluation sets to speed up the annotation process. However, my blind set will have all kinds of sentences, If I want to make the training/evaluation sets as close as blind set, I need to include some ignored sentences in the training/testing as well. right? I’m little confused here. Can you please advice?

Hi! Just to make sure I understand the question correctly: By “ignore”, you actually mean the grey ignore button, right? And if so, is there a reason you’re ignoring the sentences instead of rejecting them?

In general, the “ignore” button is mostly intended for very specific examples that you want to skip. For instance, if you’re annotating comments scraped from the web, you might end up with broken markup or one sentence in a different language. Sometimes you also have examples that are confusing and difficult, so instead of spending a long time thinking about it, it’s better to just ignore and move on.

Buf of course, by that logic, if you choose to ignore examples based on some objective (broken markup, wrong language), you also shouldn’t be evaluating against a set with those types of examples in it. Or, put differently, if your runtime model needs to be able to handle certain types of texts, those should be present in both your training and evaluation set.

“ignore” means the grey ignore button. Thanks for your clarification on it. It helps a lot! So my next question is: the ratio of accept/reject in my blind set is pretty low (maybe around 20-30% or even lower). but when I prepare my training&testing sets for annotation, shall I make the ratio of accept/reject around 0.5 ?

My blind set is about 20M. I’m selecting 20K to annotate on. I will try to include all possible sentence type (different level of subjectivity/polarity score) in the training&testing sets. Does it make sense? If I random select 20K, It might still not be able to cover all kind of sentences.

Thanks,

I think if your class balance is around 80:20, it’s not worth trying to rebalance the data, because the model you learn will have a bias that’s different from the evaluation data.

If the classes are extremely imbalanced (like 95:5 or worse), it can be worth trying to sample more of the rare class into your annotations, using some search strings or something. This is especially the case if what you care about is actually not overall accuracy, but something like having high recall over the rare class. It’s very common to have that sort of different interest in the accuracy if one of the classes is rare.

In summary, I would first try the simple thing, and not worry about the imbalanced classes. If you find you have extremely imbalanced classes, you can look at solutions, but these tend to be quite problem specific.

Thanks for your insights. It’s really helpful! Now I have another issue. I used the “prodigy sentiment xxx xxx.jsonl -F recipe.py” to do text classification on 3 categories. How to train the model afterwards? textcat.batch-train did not work. Should I restructure the annotation jsonl file first?

Yes, the textcat.batch-train recipe expects binary annotations (i.e. accept/reject), whereach your recipe creates multiple choice annotations where each example has an "accept" key with the selected labels. If you export your data using prodigy db-out, you’ll see the format.

So you pretty much have 2 options – one is to just use spaCy directly to train a text classifier. You can call nlp.update with a text and a dictionary of annotations. So for example: {'cats': {'HAPPY': True, 'SAD': False}}. You can generate this format from your Prodigy annotations and then write your own littel training script.

The other option would be to convert it to the binary format, import it to a new dataset and then use textcat.batch-train. Just keep in mind that you also need negative examples – so you probably want to create one example per text per label, mark the correct label(s) as "answer": "accept" and all others as "answer": "reject". Otherwise, your model may learn that “every example is accepted!”, which gives you a great accuracy of 100% – but a really useless classifier :stuck_out_tongue:

Here’s an example for a conversion script (updated version of the one in this thread):

from prodigy.components.db import connect
from prodigy.util import set_hashes
import copy

db = connect()
examples = db.get_dataset('your_dataset_name')
labels = ('HAPPY', 'SAD', 'ANGRY')  # whatever your labels are

converted_examples = []  # export this later

for eg in examples:
    if eg['answer'] != 'accept':
        continue  # skip if the multiple choice example wasn't accepted
    for label in labels:
        new_eg = copy.deepcopy(eg)  # copy the example
        new_eg['label'] = label  # add the label
        # If this label is in the selected options, mark it this example
        # (text plus label) as accepted. Otherwise, mark it as rejected, i.e. wrong
        new_eg['answer'] = 'accept' if label in eg['accept'] else 'reject' 
        # not 100% sure if necessary but set new hashes just in case
        new_eg = set_hashes(new_eg, overwrite=True)
        converted_examples.append(new_eg)

Thx Ines, your guidance is very helpful! I’m able to train text classification model successfully on binary annotations. I’m getting 3 probability numbers of a sentence being in the three categories. should the 3 probability numbers add up to 1.0 ?

another question : After I create the model which can give predictions on three labels all together, I can get the overall precision and recall. but is there a way to get precision and recall for the individual label (eg: ‘HAPPY’, ‘SAD’, ‘ANGRY’)?

Unfortunately the textcat evaluation doesn’t return that information currently. You should be able to implement it yourself if you run the model over the data, but we’ll definitely consider implementing this in future, as a couple of people have asked for it.

What about the "rejected" annotations? Are they useful for training and evaluation? In other words, should they be added to training.jsonl and evaluation.jsonl?

This depends on the type of annotations – but typically, yes. If your annotations are binary, "reject" means that the label doesn't apply, so this is obviously relevant and the equivalent of {"LABEL": False}.

If your annotations are multiple choice annotations, Prodigy currently "inverts" the selection and treats all unselected labels as missing if labels aren't mutually exclusive. So if you select LABEL1 and then reject the example, it results in {"LABEL1": False, "LABEL2": None}. But you can choose to handle this differently and just ignore all rejected examples in your data.