Some idea to optimize llm annotations in unbalanced datasets?

info2000 · November 11, 2023, 3:06pm

The questions:
1)on llm annotate or textcat.llm.fetch is possible to filter what examples send to the llm model to don't expend money on well trained labels?

some idea to annotate with model using Setfit or other local few shot models

Context
I have a textcat dataset with 100000 examples, 34 labels
4 of these labels need to have more examples

I pre-annotated these 4 cats with patterns, to speed up, and filtered with a custom recipe to display only these labels to human review

And use some model as annotator could speed up my model fine tuning

Thanks

koaning · November 15, 2023, 9:21am

I'll respond to your questions seperately below.

The textcat.llm.fetch recipe needs an examples.jsonl file to send to an LLM provider. Nothing is preventing you from doing some analysis yourself such that the examples.jsonl file is a subset of interest that only contains the examples that you'd want to send. In your case, you may remove examples that the model is already confident about.
Funny you mention SetFit. It's one of the models that I may add to the Prodigy-Huggingface plugin. I have no timeline for this, but theoretically there's no reason why a setfit model can't be used as a model in the loop. That said, you can also train a huggingface model/spaCy model to help you with that too.

Let me know if this helps. If you appreciate more advice, could you share some more context of your task? What four labels need more data? Are the labels mututally exclusive?

info2000 · November 15, 2023, 2:15pm

good @koaning Vincent
I will play with the HuggingFace plugin and will update with the process

about the first question: there's some example of custom recipe for annotations? my idea is to extend the openai.fetch recipe

thanks

koaning · November 16, 2023, 10:05am

The llm recipes are in our internal repo for now, but this does serve as a nice reminder that we should port them to our recipes repository. I've added an internal ticket and will let you know once it's been taken care of!

Topic		Replies	Views
LLM and bulk annotation ner , spancat	2	391	June 2, 2023
llm.fetch doesn't write to the database if it gets interrupted	4	232	January 3, 2024
New recipes available in recently released Prodigy 1.13.1 :tada: news	1	306	August 23, 2023
textcat.batch-train question	7	496	November 28, 2022
How can I improve a textcat model? usage , textcat	1	765	May 6, 2019

Some idea to optimize llm annotations in unbalanced datasets?

Related topics