Hi! The JSONL loader will load the file line by line and send out the examples in order. There's no magic going on here, and it really just iterates over the lines.
Prodigy will skip duplicates and examples that are already in the dataset – so if you're using the same dataset for all your experiments and you annotate the first 20 texts in the first run and start the server again, Prodigy will resume at text 21. That's typically the desired behaviour, since you want to start where you left off and not repeat any examples. If you want to start at the beginning again, the easiest and cleanest solution would be to use a fresh dataset.
If you want to make sure that Prodigy only sends out a new batch of questions when all previous questions have been answered, you can also set "force_stream_order": true
in your prodigy.json
. By default, if you open up the app twice on two different devices, you'd get two different batches of examples: the first one, and the second one. Prodigy will then wait to receive the answers. With "force_stream_order": true
, Prodigy will keep sending the first batch until it has received the answers and then move on to batch 2. This can be relevant if the order of quesions matters a lot, and you don't want it to be disrupted if the user refreshes the app. Just make sure you only have one user per session then – otherwise, you'll end up with duplicates.
Yeah, if you call list
, you're evaluating the whole generator and are essentially tokenizing and loading 17k texts into memory. That's one of the main reasons we chose JSONL as the default file format: it can be read in line by line and using generators, you can process the inocming texts in smaller batches, perform potentially expensive preprocessing and respond to outside state (like an updated model).