Hi! I understand that textcat.teach can use pattern-matching to bootstrap the labeling of (rare) classes in text classification tasks, but I would like to know your thoughts about using zero-shot classifiers (e.g. HuggingFace transformers pipelines).
In an ideal workflow, I'd like a zero-shot classifier to replace the pattern matcher, to allow me to quickly accept/reject labels set by it. By entering more and more annotations, I would 1) measure the accuracy of my zero-shot baseline 2) start training another model with the active learning logic (hopefully outperforming the zero-shot baseline).
Would this make sense as an enhancement for Prodigy?
What is the best approximation of the workflow above with current tools? I was currently thinking about
running inference with a zero-shot classifier
review the labels one by one manually for a subset of examples to create a golden set
train a spacy model over that set
textcat.teach that model to further refine it
but was wondering if there are more elegant/less cumbersome ways. Thanks!
This workflow definitely makes sense, and it's one I'd like to have an example recipe for. I would recommend just starting a new recipe for yourself. The logic should be quite simple, and it will allow you to express that logic directly without having to worry about how we've written other components. If you do write this, please do share the results.
Hi @honnibal, thanks! I gave it a (zeroth...) shot here:
I plug in the classifier only for scoring and selecting examples (covering points 1. and 2. of my rough plan above). I assume that updating the underlying model is much more tricky so I'll leave it aside for now.
I haven't used it seriously yet, and inference is a bit slow so Prodigy is slowed down by it (maybe the streaming could be improved so to run inference continuously).
Another improvement could stem from running inference on some "extended/verbose version" of the labels (as opposed to the labels themselves) so to better capture the semantics of each class.
Still, even this current version could perform better than pattern matching if labels are expressive enough, and there's no need to come up with patterns.