ner.manual skips 10 lines in text file when browser is refreshed

sked · September 26, 2018, 10:54pm

prodigy ner.manual test_dataset en_core_web_lg textfile.jsonl --label label1,label2

When I run the above command in prodigy it opens the application in a web browser. Every time I refresh the browser, it skips 10 entries in the text file. When I repeatedly refresh, it gets the 1st entry, then the 11th and then the 21st and so on.

Another problem is when I have some annotated samples stored in the database, and I have to restart the service, it steps through all the samples in the jsonl file even if they were previously annotated.

ines · September 27, 2018, 8:47am

Yes, this is currently expected, because on each load, the app makes a request to the server and asks for the next batch (by default, the batch size is 10). The annotated tasks are sent back to the server periodically, so when a new batch is requested, Prodigy can't yet know whether a question that was previously sent out was already annotated or not. (Annotating all sentences / examples is also a pretty specific goal that only applies to some use cases and data streams.)

If it's important to you that all sentences are annotated, and you do want to handle cases where the annotator refreshes their browser, you ideally want to reconcile the questions/answers at the end of a session, and compare the _task_hash to find examples in your data that you don't have an answer for in the dataset. You can either do this in a custom recipe within the stream generator, or as a separate session that you run after the previous one finished.

Prodigy is very agnostic to what existing annotations in a dataset "mean". But you can tell it to explicitly ignore identical questions that are already present in one or more datasets by using the --exclude option – for example, --exclude dataset_one,dataset_two.

sked · September 28, 2018, 12:19am

Thank you for your answers!

The --exclude option works in one of my environments but it does not work in another one which is a Kubernetes deployment using docker. Do you have any idea why this might be so? It starts from the first line of the textfile even though the sample was already annotated. I have confirmed that the repeated samples have the same input hash and task hash.

honnibal · September 28, 2018, 1:30pm

@sked Hm. Forgive me if this is a dumb question, but have you verified the tasks are being persisted correctly in your Docker/Kubernetes setup? Like, are you sure it’s not using the sqlite driver (which would get dumped when the container stops running)?

sked · September 28, 2018, 4:50pm

Im using a mysql database. I can see that annotated data is being stored in my dataset. I know that the hashes are the same by looking at this data.

sked · September 28, 2018, 4:55pm

For my other question about skipping 10 entries, it would be handy if the batch size could be made configurable. I would want to set it to 1 so that it can handle refreshes without skipping too much data.

ines · September 28, 2018, 5:08pm

You should be able to set "batch_size": 1 in your prodigy.json and the batch sizes across the app will adjust accordingly. This will affect the number of examples requested from the stream, as well as the size of the batches sent back to the server.

(Btw, note that there's currently a known issue with batch size 1 and the manual interface. This will be fixed in the upcoming release.)

sked · September 28, 2018, 5:20pm

That is very useful information! Could we have the bundle with the fix for batch size:1 ? I am trying to get a setup ready for multiple annotators.

ines · September 28, 2018, 5:21pm

Sure! Can you email me at ines@explosion.ai?

Topic		Replies	Views
Basic question about batch persistence usage	2	755	October 9, 2019
Losing samples on browser refresh usage , done , database , streams	11	1129	October 21, 2020
Losing tasks while reloading page. usage	2	705	October 15, 2018
Missing first N annotations when using ner.manual recipe usage , ner , solved	6	1074	May 15, 2019
Number of tasks doesn't match number of items in input file solved , streams	8	1028	November 15, 2019

ner.manual skips 10 lines in text file when browser is refreshed

Related topics