Shuffling Streamed Data for Annotation

Hello,
Is there a way to shuffle the data coming from a stream from a specific path in your custom recipe's config settings?

Thank you.

Prodigy lets you control the stream of incoming examples (a simple Python generator), so we don't need any specific settings for shuffling etc. You can just load and shuffle the data in your custom recipe. For example:

from prodigy.components.loaders import JSONL
import json

def get_shuffled_stream(source):
    stream = list(JSONL(source))
    random.shuffle(stream)
    for eg in stream:
        yield eg

If your data is very large and you don't want to consume the whole file upfront, you could also implement more sophisticated shuffling here – that's up to you.

1 Like

Hello Ines,

Perfect! Thought there's a way to do it with the config recipe that I missed out in the documentations maybe.

Thanks again.