Some idea to optimize llm annotations in unbalanced datasets?

The questions:
1)on llm annotate or textcat.llm.fetch is possible to filter what examples send to the llm model to don't expend money on well trained labels?

  1. some idea to annotate with model using Setfit or other local few shot models

I have a textcat dataset with 100000 examples, 34 labels
4 of these labels need to have more examples

I pre-annotated these 4 cats with patterns, to speed up, and filtered with a custom recipe to display only these labels to human review

And use some model as annotator could speed up my model fine tuning


I'll respond to your questions seperately below.

  1. The textcat.llm.fetch recipe needs an examples.jsonl file to send to an LLM provider. Nothing is preventing you from doing some analysis yourself such that the examples.jsonl file is a subset of interest that only contains the examples that you'd want to send. In your case, you may remove examples that the model is already confident about.
  2. Funny you mention SetFit. It's one of the models that I may add to the Prodigy-Huggingface plugin. I have no timeline for this, but theoretically there's no reason why a setfit model can't be used as a model in the loop. That said, you can also train a huggingface model/spaCy model to help you with that too.

Let me know if this helps. If you appreciate more advice, could you share some more context of your task? What four labels need more data? Are the labels mututally exclusive?

good @koaning Vincent
I will play with the HuggingFace plugin and will update with the process

about the first question: there's some example of custom recipe for annotations? my idea is to extend the openai.fetch recipe


The llm recipes are in our internal repo for now, but this does serve as a nice reminder that we should port them to our recipes repository. I've added an internal ticket and will let you know once it's been taken care of!