Basic question about batch persistence

nsahler-squarespace · October 8, 2019, 8:26pm

Hello, I run prodigy with

python -m prodigy ner.manual test_dataset en_core_web_sm test.jsonl --label "LABEL_A, LABEL_B, CAT, DOG"

with a batch size of 4, where test.jsonl consists of:

{"text": "Apples are good and tasty"}
{"text": "Oranges are good and tasty"}
{"text": "Pineapples are good and tasty"}
{"text": "Lemons are good and tasty"}
{"text": "Bananas are good and tasty"}
{"text": "Grapes are vegetables"}
{"text": "Burgers are fruit"}
{"text": "Sandwiches are fruit"}

and whenever I refresh, per session, the batch number seems to iterate (meaning it starts with "apples..." i refresh, then the first item is "bananas...", then I refresh and it returns "No Tasks Available")
Is there a built-in way to prevent this default batch-iterate-on-refresh behaviour and instead have it return the latest batch that hasn't actually been interacted with?

If not, are there any custom recipes I can refer to to help with this sort of thing? We are going to have many sessions labeling data but all datapoints need labels, so at least being able to control the cursor of the data would be ideal.

Thanks!

ines · October 8, 2019, 10:42pm

Hi! I've explained some of this in more detail in this thread:

ner.manual skips 10 lines in text file when browser is refreshed

Yes, this is currently expected, because on each load, the app makes a request to the server and asks for the next batch (by default, the batch size is 10). The annotated tasks are sent back to the server periodically, so when a new batch is requested, Prodigy can’t yet know whether a question that was previously sent out was already annotated or not. (Annotating all sentences / examples is also a pretty specific goal that only applies to some use cases and data streams.)

If it’s important to you that all sentences are annotated, and you do want to handle cases where the annotator refreshes their browser, you ideally want to reconcile the questions/answers at the end of a session, and compare the _task_hash to find examples in your data that you don’t have an answer for in the dataset. You can either do this in a custom recipe within the stream generator, or as a separate session that you run after the previous one finished.

My post here has a little example of an "infinite stream" that checks the incoming examples against the hashes in the database to make sure everything is annotated:

Of course, you could also come up with your own custom logic for this. Streams in Prodigy are regular Python generators that yield example dicts, so they can respond to external state and let you control what to send out when.

nsahler-squarespace · October 9, 2019, 3:25pm

Oh perfect - I'm surprised I couldn't find that - apologies and thank you for the resources

Topic		Replies	Views
ner.manual skips 10 lines in text file when browser is refreshed usage , front-end , solved	8	1312	September 28, 2018
Losing tasks while reloading page. usage	2	700	October 15, 2018
Losing samples on browser refresh usage , done , database , streams	11	1128	October 21, 2020
"No tasks available" on page refresh usage , custom , solved	5	4377	December 27, 2018
Resuming annotations within a session (after closing the browser) usage , streams	6	1411	October 24, 2019

Basic question about batch persistence

Related topics