Can one leverage zero-shot classifiers for textcat tasks?

davidefiocco · November 2, 2021, 10:42pm

Hi! I understand that textcat.teach can use pattern-matching to bootstrap the labeling of (rare) classes in text classification tasks, but I would like to know your thoughts about using zero-shot classifiers (e.g. HuggingFace transformers pipelines).

In an ideal workflow, I'd like a zero-shot classifier to replace the pattern matcher, to allow me to quickly accept/reject labels set by it. By entering more and more annotations, I would 1) measure the accuracy of my zero-shot baseline 2) start training another model with the active learning logic (hopefully outperforming the zero-shot baseline).

Would this make sense as an enhancement for Prodigy?

What is the best approximation of the workflow above with current tools? I was currently thinking about

running inference with a zero-shot classifier
review the labels one by one manually for a subset of examples to create a golden set
train a spacy model over that set
textcat.teach that model to further refine it

but was wondering if there are more elegant/less cumbersome ways. Thanks!

honnibal · November 5, 2021, 9:48am

Hi @davidefiocco ,

This workflow definitely makes sense, and it's one I'd like to have an example recipe for. I would recommend just starting a new recipe for yourself. The logic should be quite simple, and it will allow you to express that logic directly without having to worry about how we've written other components. If you do write this, please do share the results.

davidefiocco · November 6, 2021, 12:00am

Hi @honnibal, thanks! I gave it a (zeroth...) shot here:

gist.github.com

https://gist.github.com/davidefiocco/1c9e437de7b31e81bf2b8fecbe1d63ed

dataset.jsonl

{"text":"Spam spam lovely spam!"}
{"text":"I like scrambled eggs."}
{"text":"I prefer spam!"}

textcat_zero_shot.py

# Usage:
# python -m prodigy textcat.zero-shot -F .\textcat_zero_shot.py my_dataset dataset.jsonl facebook/bart-large-mnli --label SPAM,EGGS

from typing import Iterable, List

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.sorters import prefer_high_scores
from prodigy.util import split_string
from tqdm import tqdm

This file has been truncated. show original

I plug in the classifier only for scoring and selecting examples (covering points 1. and 2. of my rough plan above). I assume that updating the underlying model is much more tricky so I'll leave it aside for now.

I haven't used it seriously yet, and inference is a bit slow so Prodigy is slowed down by it (maybe the streaming could be improved so to run inference continuously).

Another improvement could stem from running inference on some "extended/verbose version" of the labels (as opposed to the labels themselves) so to better capture the semantics of each class.

Still, even this current version could perform better than pattern matching if labels are expressive enough, and there's no need to come up with patterns.

davidefiocco · January 15, 2023, 8:42pm

Also, watch this on the topic: GitHub - explosion/prodigy-openai-recipes: ✨ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3

Topic		Replies	Views
Can't improve textcat model performance textcat	2	389	May 3, 2020
textcat.batch-train usage , textcat	3	1263	August 29, 2018
How textcat.teach works under the hood usage , textcat	16	92	March 26, 2025
Textcat teach after training to better converge model's decisions usage , textcat , solved	1	364	November 11, 2020
Train binary textcat in Prodigy Nightly textcat , done , nightly	3	773	July 19, 2021

Can one leverage zero-shot classifiers for textcat tasks?

Related topics