Managing long annotation sessions

ines · November 1, 2019, 12:24pm

Hi! When examples come in, Prodigy will assign them hashes that allow it to determine whether a question has already been asked before. So if you restart the annotation session, examples that are already in the dataset should be skipped automatically. So from that perspective, it doesn't really make a difference whether you load the raw input from a database or a file.

Another thing you can do to minimise potential data loss (in case your machine dies or something) is using a lower "batch_size" in your config. This means that answers are sent back to the server sooner. If you're using ner.manual, you don't have to run any expensive predictions before you send out the examples, and your annotators are also less likely to annotate too fast. So using a low batch size should be fine.

I don't think your current workflow is that bad, actually. JSONL has the big advantage that it can be read in line-by-line, so you don't end up with your whole corpus in memory.

Where does your data come from initially? Are you extracting it from a database or a different file format? If you can access that in Python, you could also write a custom loader that fetches batches of raw data from wherever its stored. Streams in Prodigy are regular Python generators, so they can also respond to outside state, load from databases, paginated APIs etc. (I posted a super basic example on this thread the other day).

Topic		Replies	Views
multi -session annotation database , streams	5	654	April 9, 2020
Edit Saved NER Manual Annotations usage , ner , database , solved	4	1389	September 13, 2018
documents length and annotation time usage , ner , solved , streams	13	943	December 4, 2020
Resuming annotations within a session (after closing the browser) usage , streams	6	1411	October 24, 2019
ner.manual skips 10 lines in text file when browser is refreshed usage , front-end , solved	8	1310	September 28, 2018

Managing long annotation sessions

Related topics