if I use LLM to annotate like 1000 samples ( How can language models augment the annotation process? (ljvmiranda921.github.io)) for span categorization, then can I USE THAT 1000 SAMPLES TO ANNOTATE 100000 samples for a span categorization task, what recipe should I use?
Hi @shainaraza ,
and welcome to the forum
I assume you have a span dataset with 1000 examples and now you're looking for a way to boostrap your annotation further with these examples.
One way to do that would be to train a small model for predicting spans e.g. spaCy SpanCategorizer and then use it in Prodigy's spans.correct recipe to streamline the annotation of the big dataset.
You can find documentation on how to train SpanCategorizer with Prodigy or with spaCy directly here.
Another way would be to add some representative examples to your prompt and try few-shot annotation (you would use the same recipe that you used for your initial annotation, which I believe is this but you'd use the --examples_path
option to provide examples. Here you can find some documentation on the use of examples and an example file. (Lj's recipe you used for spans is very similar to ner recipes documented there and it has the same CLI options.)
That said, you normally would like to provide just a few good examples to make sure the model generalizes from them so you definitely won't be using the entire set 1000.
Hi @shainaraza , just to add to @magdaaniol 's reply, you might also want to curate the 1000 LLM-annotated samples before training a SpanCategorizer out of it. By that, we mean passing it on to spans.correct
to build a gold-annotated dataset. While you're in that step, it might also be wise to check if all your span labels are properly represented in those 1000 samples.
Hope that helps!