Setting session name as a config or CLI option

Is it possible to set a session name via a config or CLI option? I have a team where each person is running Prodigy locally. But, I'd like to still set the session so that we can trace where answers are coming from when we combine datasets.

I know that the session can be set with the ?session=name query parameter. But, it would be nice to preset the session set to a static value for these individually run instances of Prodigy.

Ah, so if I understand the use case correctly, you want to attach additional meta info to the data you're annotating that's preserved when you merge datasets?

Recipes support a get_session_id function that was initially added to override the default timestamp session IDs (e.g. if you're starting Prodigy instances programmatically and end up with multiple sessions per second). So in a custom recipe, you could add a command-line argument for the annotator name and then

"get_session_id": lambda: annotator_name

However, in that case, you might as well keep the automatic timestamp session ID and add your meta data to each example in the stream before you send it out for annotation. Any custom properties added to the annotation tasks will be passed through and saved in the database.

def add_meta_to_stream(stream):
    for eg in stream:
        eg["annotator_name"] = annotator_name

stream = JSONL(source)  # or whatever
stream = add_meta_to_stream(stream)

A downside of this approach is that you need to write a custom recipe, or at least wrap an existing recipe function so you can add your custom arguments and logic. And you need to edit it if you ever want to add more meta data (like an internal project ID etc).

A more elegant approach I can think of: use a custom loader script that takes command-line arguments and adds the annotator name (and any other metadata) to the stream, then pipe that forward into Prodigy. All recipes that take an input source can also read from standard input. So you could write a custom loader script like this:

# loader.py
import sys
from prodigy.components.loaders import JSONL

filename = sys.argv[1]  # rudimentary arg parsing
username = sys.argv[2]
examples = JSONL(filename)
for eg in examples:
    eg["annotator_name"] = username
    print(eg)

And then call it like this – the - source value tells Prodigy to read from standard input, i.e. the data you're piping forward:

python loader.py ./data.jsonl king | prodigy ner.manual your_dataset en_core_web_sm - --label ONE,TWO

This will now stream in the data and add "annotator_name": "king" to all examples that come in. If you ever want to add more meta, you can modify your loader and take more arguments. You could also read from environment variables or somewhere else – this really depends on what you prefer.

That's really interesting! I didn't know it was possible to attach arbitrary metadata.

The idea of the custom loader fits well with the design of some Kafka stream processors here. Thanks for the detailed response!

1 Like