How can I improve a textcat model?

I have a textcat model that I’m happy with, and am running it against real data. I’m pretty happy with the results, but I also see a lot of examples where the model is finding a category when it isn’t there. I’d like to feed these examples back into the model as rejections, but I’m not entirely sure how; is it safe to make a set of just rejections, or will that introduce a problem of catastrophic forgetting? Do I need to mix it in with accepts that the model has already seen, or generate new ones?

Hi! You definitely want to prevent your model from overfitting and learning something like “Category X is always reject and never applies anymore” (which would be the other extreme). One solution for that is to mix the rejected examples in with your training data and then retrain the model from scratch.

You could also create a simple Prodigy recipe that streams in your texts, processes them with your model and sends out examples and labels with a high score (e.g. >= 0.5 or whatever else you consider high enough). Here’s an example:

def get_stream(source_file, model_path):
    stream = JSONL(source_file)
    nlp = spacy.load(model_path)
    for eg in stream:
        doc = nlp(eg["text"])  # Process text with your model
        for cat, score in doc.cats.items():
            if score >= 0.5:  # Or any other threshold
                new_eg = copy.deepcopy(eg)  # Deepcopy example
                eg["label"] = cat 
                yield eg

Maybe you also want to add other custom logic here – for instance, focus on specific labels only, or adjust the score threshold per label (if you want to double-check more predictions for a given problematic label).

Finally, you could also give the textcat.teach recipe a try, which can help you with the example selection. What it does is actually quite similar to the code I posted above – but it also updates the model in the loop as you annotate, focuses on uncertain predictions (e.g. closest to 0.5.) and uses an exponential moving average to track the scores it sends out, so you never get stuck.