LLM and bulk annotation

shainaraza · May 29, 2023, 11:44am

if I use LLM to annotate like 1000 samples ( How can language models augment the annotation process? (ljvmiranda921.github.io)) for span categorization, then can I USE THAT 1000 SAMPLES TO ANNOTATE 100000 samples for a span categorization task, what recipe should I use?

magdaaniol · May 31, 2023, 12:03pm

Hi @shainaraza ,

and welcome to the forum
I assume you have a span dataset with 1000 examples and now you're looking for a way to boostrap your annotation further with these examples.
One way to do that would be to train a small model for predicting spans e.g. spaCy SpanCategorizer and then use it in Prodigy's spans.correct recipe to streamline the annotation of the big dataset.
You can find documentation on how to train SpanCategorizer with Prodigy or with spaCy directly here.

Another way would be to add some representative examples to your prompt and try few-shot annotation (you would use the same recipe that you used for your initial annotation, which I believe is this but you'd use the --examples_path option to provide examples. Here you can find some documentation on the use of examples and an example file. (Lj's recipe you used for spans is very similar to ner recipes documented there and it has the same CLI options.)
That said, you normally would like to provide just a few good examples to make sure the model generalizes from them so you definitely won't be using the entire set 1000.

ljvmiranda921 · June 2, 2023, 9:14am

Hi @shainaraza , just to add to @magdaaniol 's reply, you might also want to curate the 1000 LLM-annotated samples before training a SpanCategorizer out of it. By that, we mean passing it on to spans.correct to build a gold-annotated dataset. While you're in that step, it might also be wise to check if all your span labels are properly represented in those 1000 samples.

Hope that helps!

Topic		Replies	Views
Span and TextCat but with a LLM	3	178	June 11, 2024
Training new model using annotations from ner.manual ner , spacy	2	677	June 28, 2018
Dataset preparation	4	260	May 18, 2023
Some idea to optimize llm annotations in unbalanced datasets? usage , textcat	3	244	November 16, 2023
Does data need to be reannotated to use train recipe for predicting span labels after rel.manual recipe was used? usage , ner , spancat	1	386	October 15, 2021

LLM and bulk annotation

Related topics