When I run the above command in prodigy it opens the application in a web browser. Every time I refresh the browser, it skips 10 entries in the text file. When I repeatedly refresh, it gets the 1st entry, then the 11th and then the 21st and so on.
Another problem is when I have some annotated samples stored in the database, and I have to restart the service, it steps through all the samples in the jsonl file even if they were previously annotated.
Yes, this is currently expected, because on each load, the app makes a request to the server and asks for the next batch (by default, the batch size is 10). The annotated tasks are sent back to the server periodically, so when a new batch is requested, Prodigy can’t yet know whether a question that was previously sent out was already annotated or not. (Annotating all sentences / examples is also a pretty specific goal that only applies to some use cases and data streams.)
If it’s important to you that all sentences are annotated, and you do want to handle cases where the annotator refreshes their browser, you ideally want to reconcile the questions/answers at the end of a session, and compare the _task_hash to find examples in your data that you don’t have an answer for in the dataset. You can either do this in a custom recipe within the stream generator, or as a separate session that you run after the previous one finished.
Prodigy is very agnostic to what existing annotations in a dataset “mean”. But you can tell it to explicitly ignore identical questions that are already present in one or more datasets by using the --exclude option – for example, --exclude dataset_one,dataset_two.
The --exclude option works in one of my environments but it does not work in another one which is a Kubernetes deployment using docker. Do you have any idea why this might be so? It starts from the first line of the textfile even though the sample was already annotated. I have confirmed that the repeated samples have the same input hash and task hash.
@sked Hm. Forgive me if this is a dumb question, but have you verified the tasks are being persisted correctly in your Docker/Kubernetes setup? Like, are you sure it’s not using the sqlite driver (which would get dumped when the container stops running)?
For my other question about skipping 10 entries, it would be handy if the batch size could be made configurable. I would want to set it to 1 so that it can handle refreshes without skipping too much data.
You should be able to set "batch_size": 1 in your prodigy.json and the batch sizes across the app will adjust accordingly. This will affect the number of examples requested from the stream, as well as the size of the batches sent back to the server.