Sorter Batch Size / Local Sorters?

kmader · March 14, 2018, 8:13am

When correcting the annotation on a number of large images, there is a substantial loading time because it looks like (based on the recipe here Using Prodigy to train a new Computer Vision object detection model) the prefer_uncertain function pulls all (or a large number of images) through the loading and prediction pipeline. I wrote a small wrapper function for the sorters called local_sorter to only operate on small minibatches. Presumably there is a better way to adjust the ‘batch size’? It didn’t appear in PRODIGY_README.html#sorters and all of the doc strings have been stripped out. Setting “batch_size”: 10 in the prodigy.json also does not seem to effect the results. I ask because for the model I am running it is very time-consuming and it would be nice to run on fewer samples, particularly during development.

def local_wrapper(sorter_func, n = 10):
    def _new_sorter(in_stream):
        for first_ele in in_stream:
            m_batch  = [first_ele]+[x for _, x in zip(range(n), in_stream)]
            for z in sorter_func(m_batch):
                yield z
    return _new_sorter

ines · March 14, 2018, 1:36pm

Ah, cool to see that you're trying the image recipe!

The prefer_uncertain sorter has an initial "warm up" period during which it's conditioning the moving averages. The size of the pre-batch is defined in the first_n attribute (currently 64 and not exposed as an argument – but we can easily fix that!). In the meantime, you should be able to simply overwrite the first_n after the sorter is initialised:

sorted_stream = prefer_uncertain(stream)
sorted_stream.first_n = 10

The initial pre-batching is the only batching the sorter will do – after that it will simply yield out the examples, given they meet the threshold. I guess a pre-batch of 64 examples was slightly more optimised for working with text – for images, it definitely makes sense to adjust that.

Damn, this shouldn't be happening! I spent so much time writing nice docstrings for the internals, so you can call help on them if you need/want to Will check our compiler settings and hopefully fix that for the next release!

Topic		Replies	Views
get_session_questions takes many time when use a sorter and always return same example usage , textcat	6	419	May 31, 2022
textcat.teach to show all the docs in stream, despite their score textcat , spacy	5	578	August 7, 2018
Prodigy Active Learning prefer_uncertain mechanism usage , custom , pytorch	8	1851	November 23, 2020
Is it possible to combine prefer_high_scores and prefer_uncertain so that a combination of high score and mid score documents can be batched out usage , streams	1	398	June 24, 2020
How to manually iterate a sorter usage	1	471	September 20, 2019

Sorter Batch Size / Local Sorters?

Related topics