Hi! When examples come in, Prodigy will assign them hashes that allow it to determine whether a question has already been asked before. So if you restart the annotation session, examples that are already in the dataset should be skipped automatically. So from that perspective, it doesn't really make a difference whether you load the raw input from a database or a file.
Another thing you can do to minimise potential data loss (in case your machine dies or something) is using a lower
"batch_size" in your config. This means that answers are sent back to the server sooner. If you're using
ner.manual, you don't have to run any expensive predictions before you send out the examples, and your annotators are also less likely to annotate too fast. So using a low batch size should be fine.
I don't think your current workflow is that bad, actually. JSONL has the big advantage that it can be read in line-by-line, so you don't end up with your whole corpus in memory.
Where does your data come from initially? Are you extracting it from a database or a different file format? If you can access that in Python, you could also write a custom loader that fetches batches of raw data from wherever its stored. Streams in Prodigy are regular Python generators, so they can also respond to outside state, load from databases, paginated APIs etc. (I posted a super basic example on this thread the other day).