Hello, I am new to prodigy and using it for creating a dataset and then training for span categorization. I have extracted a dataset of 6000 samples, in which I will keep 2000 best ones in my final dataset. I have loaded the 6000 samples in prodigy and have manually annotated 400 samples using spans.manual. Going further I want to use spans.correct to annotate the remaining samples excluding the already annotated data. How can I extract the best 2000 samples from my final dataset (i.e. only include the accepted annotated samples)? Thank you
Could you elaborate what you mean with "2000 best ones"?
If you want to train a spancat model for spans.correct
you can leverage the train
recipe that's documented here. There's also a full guide on training spancat models on our docs here:
Yes I am planning to try spans.correct. I hope it works fine.
To elaborate, I actually have a dataset of 6000 patent claims extracted from Big query but all of them are not complete and some contain very less text and are not very relevant . My target is to annotate 2000 claims from these 6000. I extracted extra claims so that I could pick out the 2000 best samples for my model. I have loaded all of them in prodigy now and have annotated 400 manually and will use the spans.correct on this dataset of 6000 claims, While annotating, the claims that I do not find relevant are rejected by me (using the reject button on the prodigy interface) and the ones relevant are annotated and accepted. I want to keep 2000 accepted claims now, as soon as I reach 2000 accepted samples my dataset is done. I then want to extract my dataset. Is it possible, if yes how to do it?
I hope I clarified my concern. Reach out if I am not clear
If there are no spans in a document, you may also want to hit "accept", but it depends a bit on what you're interested in doing with them afterwards. By accepting a document with no spans, you are still collecting training data in the sense that a model should not detect any spans there. Examples that demonstrate when spans should not be predicted are still useful for the model.
If you really want to focus on examples where a span does appear you may choose to train a spancat model (which you can do with prodigy train) and then use this to make your annotation easier. One way of doing this is to write a Python script that will filter all the examples to only keep the examples with detect spans. Then, you can pass this subset to spans.manual or spans.correct.
Just to check, are the spans that you're interested in rare? If not, it might be easier to not worry about creating a subset and to spend the effort of creating more annotations instead.
Thank you very much