Hello! I'm trying to figure out the best way to feed Prodigy very large inputs that will require long sessions of manual labeling.
My first thought was to load it all into a dataset unlabeled, but db-in doesn't seem to work for that.
Right now we are simply running ner.manual against the very large source .jsonl file, but am concerned that if the session is interrupter, it would be difficult to make sure that we don't lose work or end up with duplicate data.
We could also split up the file before launching Prodigy, but that doesn't seem ideal to me. Can you recommend a good way to do this?
Hi! When examples come in, Prodigy will assign them hashes that allow it to determine whether a question has already been asked before. So if you restart the annotation session, examples that are already in the dataset should be skipped automatically. So from that perspective, it doesn't really make a difference whether you load the raw input from a database or a file.
Another thing you can do to minimise potential data loss (in case your machine dies or something) is using a lower "batch_size" in your config. This means that answers are sent back to the server sooner. If you're using ner.manual, you don't have to run any expensive predictions before you send out the examples, and your annotators are also less likely to annotate too fast. So using a low batch size should be fine.
I don't think your current workflow is that bad, actually. JSONL has the big advantage that it can be read in line-by-line, so you don't end up with your whole corpus in memory.
Where does your data come from initially? Are you extracting it from a database or a different file format? If you can access that in Python, you could also write a custom loader that fetches batches of raw data from wherever its stored. Streams in Prodigy are regular Python generators, so they can also respond to outside state, load from databases, paginated APIs etc. (I posted a super basic example on this thread the other day).
Hi Ines, thank you for the answer! I hadn't noticed the hashing behavior in the documentation, but that does put my mind at ease that we won't be duplicating training examples, even if the server goes down. I agree also that my concern about losing data would be mitigated by batch_size. For the moment we just decided to manually save after each document is complete -- that accomplishes the same basic thing, correct?
Currently, we use a Python script to run every image in a directory through OCR, saving the output to a jsonl file, so I think translating that to a custom loader would be pretty straight forward. I will definitely keep that in mind as we develop this further.
Sorry if this wasn't as prominent as it should have been! You can also customise the hashing btw and assign your own "_task_hash" and "_input_hash" properties to the incoming examples. The input hash represents the raw input (e.g. the text) and the task hash the question about the text (e.g. text plus label, to allow asking multiple different questions about the same information). In your case, those would probably be identical.
There's also an instant_submit config option that will send the answer back to the server instantly after the user hits accept/reject/ignore. This is the "safest" solution – but the main trade-off is that you lose the ability to undo and go back in the browser.