We are using Prodigy to annotate NER and several annotators are doing the annotations in parallel in different sessions. I reduced the batch size from 10 to 2 because when annotators would refresh their browser or close and reopen we would lose 10 samples and I wanted to minimize the loss. Is this behaviour expected? Is there a way to retrieve the lost samples?
Another problem we encountered when setting the batch size to 2 is that when annotators annotate too fast, we get a "no tasks available" error although there are tasks, and we need to refresh the browser to get more tasks.
Hi! Which version of Prodigy are you using and which workflows are you running?
The examples aren't really "lost" – the batch is just sent out and Prodigy only knows it's not coming back when you end the annotation session). When you restart the server, everything that's not annotated in the current dataset will be queued up again. "force_stream_order": True should be the default in the latest versions and it means that all baches are sent in the exact same order and re-sent when you refresh the browser. You can also set this yourself int the config if you're using a custom recipe. (You just couldn't have multiple people connecting to the exact same session because then there's no way to tell who already annotated what and you may see the same batch twice.)
The problem here is likely that a batch size of 2 is to small to always ensure there's more work available if you annotate very fast. Any preprocessing etc. that you do can also have an impact – for instance, if you run a model over the examples, this might also make it take slightly longer until more questions are available. So the best solution is to just experiment with different batch sizes until you find the best trade-off for the annotation speed.
Thanks for your answer! I am using prodigy v1.10.1 and I am using custom recipes for annotating NER and for correcting the annotations. I tried to use the "force_stream_order" but it doesn't work now because of the way we are loading the data (from a json file). I will try to fix that.
I tried to look more into the issue with setting force_stream_order to True. I thought the error was coming from the way I was loading the data but then I tried with a local file and I still get the following error "AssertionError: RepeatingFeed requires a database connection".
I am using a PostgresqlExtDatabase, could that be the cause of the problem?
I just had a closer look and I think what's happening is this: when force_stream_order is enabled, Prodigy uses a different type of logic to orchestrate the stream of examples. (In this scenario, it needs to be a bit more complex because Prodigy needs to keep track of what's already been sent out to which session, what's coming back and what's already in the DB, so it can re-send the questions). For some reason, the database object doesn't seem to get passed through correctly here so Prodigy falls back to connecting to the default DB specified in the prodigy.json. It should be easy to fix and we'll include the fix in the next release
In the meantime, a pretty simple (and actually quite elegant) workaround would be to just register your custom database with a string name, so you can refer to it in your prodigy.json. The following should work:
from prodigy.util import registry
from prodigy.components.db import Database
db = PostgresqlExtDatabase(...) # etc.
db = Database(psql_db, "custom_postgres", "Custom PostgreSQL Database")
Edit: Forgot one line in my code example above (also see here).
The code can go before or inside your custom recipe, and you won't need to return "db" from the recipe anymore. Instead, you should now be able to write "db": "custom_postgres" in your prodigy.json`, and it will be the default database used by the recipe and internally.
I have run into a similar issue and have been trying to take inspiration from this thread, however, the provided solution is not working for me. Firstly, I would like to understand how the force_stream_order flag functions. Since the stream of data is coming from a file or stdin (in my case), if any data has been pulled from the stream it can't be sent back. In that case I am thinking that samples that have not yet been saved are stored in the destination dataset temporarily in case the browser refreshes and the examples are lost. And if the browser does refresh they are pulled back from the database instead of the stream so that the user can complete the annotation on them. Is my understanding correct?
I am curious because in my case I am printing the data with one recipe and then piping the result to another recipe that reads a stream of data from stdin and uses that for annotation. For this case the provided solution is not working. When I set the force_stream_order flag to true and pass the Postgres DB for the destination dataset as a parameter in the prodigy.json (same as has been proposed in the solution), it leads to the error "AssertionError: RepeatingFeed requires a database connection". Are you sure that this solution works?
That's not entirely correct – examples are really only stored in the database when they're annotated. Here's how the stream works: when you open the Prodigy app in a browser, the app will request the next batch of examples from the stream. This is just a slice of the generator. When you open the app again in another tab, it will ask for the next batch from the stream. The same happens if you refresh the browser or close and reopen. Until the server is stopped and the session ends, Prodigy won't know whether the examples that were sent out are coming back – maybe session 1 takes longer to annotate, maybe their connection died. When you restart the server, Prodigy will know which answers it has, and which examples in the stream haven't been annotated yet, and will send out the unannotated examples again.
If you set force_stream_order, Prodigy will keep a copy of the stream and re-send examples in the exact order they came in, and re-send batches until they're answered. So when you open the same session in a different tab or refresh the browser, the same batch is sent again. Also see this thread for some background: Missed examples on prodigy interface
Yes, you'd just have to register the database as shown in my example above, then it should work.
Thanks for the response Ines. I now have a better understanding of how it works. I also experimented with restarting the server and I observed that the examples actually do not get lost and they are fetched again from the stream as you explained. However, I am not able to make it work with 'force_stream_order' set to true. I am doing exactly like you asked i.e. connect to the Postgres DB, wrap the connection in the Database, register the database and then pass it as a parameter to the prodigy.json. However, I get the same error: "AssertionError: RepeatingFeed requires a database connection." . Its weird since it works for you
That's definitely strange! Where are you executing that code, and are you setting the custom database name in your prodigy.json? Maybe also double-check with PRODIGY_LOGGING=basic to see if it indeed connects to your custom database, and check if your prodigy.json maybe overwrites your recipe settings, or a local prodigy.json overrides the global config etc.