Saving out annotations by session ID

Hi @ines Hope this message finds you well.

My use case:

  1. We have our own machine learning model that currently accepts a .csv format data to train itself
  2. We first extract from this ML model, convert it to JSONL format and feed it to stream
  3. Then, using Prodigy we re-annotate, extract JSONL file, convert to .csv and feed it to our ml.

Through writing some custom recipe, everything is working as intended. But now, i want to utilize the PRODIGY_ALLOWED_SESSIONS to permit only those allowed to annotate.

Audience is non-technical and won't have access to the terminal. All they will see is the UI.

Most of the work takes place in on_exit function:

def on_exit(controller):
    examples = controller.db.get_dataset(controller.session_id)
    examples = [x for x in examples if x["answer"] == "accept"]
    for row in examples:
        bodyList = []
        for span in row.get('spans'):
            raw_annotations = {span['label']: [token['text'] for token in row.get('tokens') if
                                               (token['id'] in range(span['token_start'], span['token_end'] + 1))]}

            for k, v in raw_annotations.items():

                bodyList.append(
                    ['col1', 'col2', mergeElements(v), 'col4', k])

        for line in bodyList:
            with open("AnnotatedOn" + controller.session_id + ".csv", 'at') as csvFile:
                write = csv.writer(csvFile)
                write.writerow(line)

return {
    "view_id": view_id,
    "dataset": dataset,
    "stream": stream,
    "config": {
        "lang": nlp.lang,
        "labels": labels
    },
    "on_exit": on_exit
}

Note: mergeElements function just merges the annotated terms after a clean up. Works as expected.

I understand if I do set the environment such as

import os
os.environ["PRODIGY_ALLOWED_SESSION"] = 'manchesterUnited'

I need to use get_session_id function that takes controller as the argument.

The ultimate goal is to write annotator's name to the csv file name. I will add fileTime = datetime.now().strftime('%b%d_%Y_%H:%M:%S') to distinguish annotated work which i know i will lose when i use get_session_id()

Goal is to have a file name:

AnnotatedBy_manchesterUnited_On_fileTime.csv

Can you point me in the right direction? Thank you!

Hi! I think the get_session_id function might not be what you want and the solution is actually a bit easier. The get_session_id callback just generates you one session ID on the fly – it was introduced to make it easy to programmatically launch multiple instances of Prodigy, without getting clashes because they're launched within the same millisecond.

If you're using named multi-user sessions, the same controller can have multiple sessions that are added and defined at runtime. The controller.session_id doesn't reflect that – instead, the ID of the session written to the data.

So if you want to export the data to files at the end of the annotation process, you could just load the dataset and look at the "_session_id" value of each task dict. This will contain the name of the session.

(The upcoming version of Prodigy will also have a few more helper functions and properties on the controller to get the names of all currently active sessions or all annotations by session, so you don't have to put that together yourself.)

This will mean that only ?session=manchesterUnited is valid and accessing the app with any other session names will raise an error. This does not replace authentication or anything, but it can help prevent typos and wrongly attributed annotations.

It was under my nose this entire time. Thank you @ines

you guys rock if I haven't said it already!

1 Like