Multiple annotators annotating sentences from the same set of data, but each annotator gets different sentences

I have 2-3 annotators from my team annotating sentences from the same set of data. I would like to have a setup where each annotator will not see sentences batched out to the other annotators (i.e., a sentence should not be annotated by more than one annotator).

Currently, the data set containing the to-be-annotated sentences sits in a folder on an internal server that can be accessed from the annotators' computers. However, I am not sure what the best way is to set up the annotation project so that the different annotators won't potentially receive the same sentences. I assume if each annotator starts his or her own prodigy server (i.e., prodigy.serve()) on his/her own computer, then they will get the same sentences that the other annotators get because each annotator's prodigy server is agnostic about the other annotators' prodigy servers?

A related question: the default setup is that annotations from an annotator are stored in a sqlite database on his/her local machine. Is there a way to set up a database , say, in a folder on an internal server, so that different annotators' annotations are saved in the same database?

Thanks!

Regarding my last question about saving annotations in a database in a different location: After some digging, it looks like modifying the prodigy.json file allows me to change the location as well as the name of the annotation database. However, I wonder if it is possible to make the changes to db config in the Python interpreter while running prodigy.serve? I tried the following to no avail:

prodigy.serve('textcat.manual test_annotation_project "//server01/test_project/sentences.jsonl" --label TARGET', 
              db="sqlite",
              db_settings={"sqlite": {"name": "prodigy_test.db", "path": "//server01/test_project/"}})

The annotations still get saved to the default prodigy.db

Hi! I think the database situation is probably what you want to address first:

Yes, that's no problem. The default is a local SQLite database in the local home directory, because that's the easiest. But you can customise the path and filename it uses, or use a MySQL or PostgreSQL database instead. See here for details and instructions: Database · Prodigy · An annotation tool for AI, Machine Learning & NLP

Once all annotators are writing their annotations to the same database, Prodigy will be able to tell what's already been annotated and use that information to decide who gets to annotate what. There are different ways to approach this, depending on your requirements:

  1. If people are not annotating at the same time, it'll just work out-of-the-box and you won't have to do anything. Prodigy will automatically exclude annotations present in the current dataset, or in datasets you define via --exclude. So if all annotators are writing to the same set or are defining each other's sets via --exclude, an example will only get sent out if it hasn't been annotated yet.
  2. If people are potentially annotating at the same time on their own instances on their machines, you can set up a recipe with a custom stream that checks if incoming examples are already annotated in the current dataset, and only sends them out if they're not. So you can connect to the database, periodically call db.get_task_hashes to get all hashes in the dataset, and then check if the hash of the incoming example is already in the dataset.
  3. The other option would be to use named multi-user sessions and have everyone on the same instance. You can set "feed_overlap": false to annotate with no overlap and only send out each example once, to whoever is available.

Thanks for the very detailed answers, Ines. Just to follow up on changing database directory and name: Is it possible to make those changes when running the function prodigy.serve() or does it have to be done in the prodigy.json. I tried the code below thinking that it would change the database location/name, but it didn't work. The reason why I would ideally want to change the location and name at the time of executing prodigy.serve is because the annotators will be working on some other annotation projects that led by a colleague of mine, who will probably want to have their annotation database saved in a different location and named differently. Thanks again!

prodigy.serve('textcat.manual test_annotation_project "//server01/test_project/sentences.jsonl" --label TARGET', 
              db="sqlite",
              db_settings={"sqlite": {"name": "prodigy_test.db", "path": "//server01/test_project/"}})

By "it didn't work", do you mean that Prodigy didn't use those config settings, or that the database connection didn't work? The way you've specified the config settings looks correct, so that part should work. (Just to be safe, also double-check that your local prodigy.json doesn't specify any conflicting "db_settings".)

Sorry for not being clearer. When I specified the db_settings argument in prodigy.serve, Prodigy ignored the config settings and instead continued to save annotations to the default prodigy.db database in the default directory. the prodigy.json file has only {} currently. Thanks!

To follow up: I was digging around in the files in the prodigy folder (/site-packages/prodigy/).

I noticed that in the app.py file there is a function set_controller() that has the following block of code:

for setting in ["db_settings", "api_keys"]:
    if setting in config:
        config.pop(setting)

It appears that whatever database settings that are provided to prodigy.serve are meant to be discarded and ignored? I tried removing db_settings by changing the code block to below, but db_settings is still ignored.

for setting in ["api_keys"]:
    if setting in config:
        config.pop(setting)

Ah, I don't think it it's that part exactly, since this is only executed when the config is sent to the server (and to prevent your database details from being sent to the web app). However, I do think I see the problem now and it's similar: the prodigy.serve updates the config after the controller is already initialised. This is fine for UI config settings, but by that time, the database will have been created already.

A possible solution in your case would be to use a custom recipe and return a Prodigy Database object as the value of "db" returned by the recipe (not the config setting but the actual recipe component, alongside "dataset", "stream" etc.). That should use the DB you provided.

Yeah, it looks like when controller = loaded_recipe(*cli_args, use_plac=True) is run (in the serve() function in __init__.py), the function connect_sqlite() in db.py is run -- however, the database settings are not passed to this function.

I will try your suggestion, and will come back with any questions. But if there is any future plan in correcting this problem, that would be much more convenient. Thanks!

Xiao

Ahhhh, I just realised I forgot a much much simpler solution: you could just set the environment variable PRODIGY_HOME to customise the user home directory. This is where the database and config settings are expected to be. If all users set the same one, they'll all use the same config and database.

I managed to change the path by setting PRODIGY_HOME to my custom location. I used os.environ to do that. However, despite being able to change the location, I am still unable to change the database name. But I suppose that is something I can live with for now...If you have any other suggestion on changing database name that would be really appreciated!

HI @ines, I have a follow up question. I was looking into creating a custom recipe with a custom stream as you suggested here:

  1. If people are potentially annotating at the same time on their own instances on their machines, you can set up a recipe with a custom stream that checks if incoming examples are already annotated in the current dataset, and only sends them out if they're not. So you can connect to the database , periodically call db.get_task_hashes to get all hashes in the dataset, and then check if the hash of the incoming example is already in the dataset.

My question is that since data is batched out in batches of 10 (obviously I could change the batch size to a smaller number, or even 1), a potential scenario could be that two annotators start their respective instances on their own computers, and because they just start the annotation instances, they will be given the same 10 documents to annotate? Or am I understanding it incorrectly? It looks as though the different annotators' instances would be agnostic of what documents have been batched out to other annotators

In theory, if both annotators start annotating at pretty much exactly the same time, then yes, it can happen that they get the same batch. If annotator 1 starts earlier, annotates the batch and sends it back, and annotator 2 then starts and requests their batch, they wouldn't receive the same answers, because their stream would check what's already in the shared dataset, and not send out the batch that annotator 1 has just annotated.