Saving and retrieving annotations

I’m trying to use a custom HTML recipe to save to compare whether two texts are equal. Reading the documentation and other support entries I managed to write the following recipe:

from prodigy.components.db import connect
from prodigy.components.loaders import JSON


db = connect()


def update_db(answers):
    print("Answers:", answers)
    db.add_examples(answers, ['dataset'])


def on_exit(ctrl):
    answers = ctrl.db.get_dataset(ctrl.session_id)
    print("Answers:", answers)
    db.add_examples(answers, ['dataset'])


@prodigy.recipe('compare-content',
                dataset=prodigy.recipe_args['dataset'])
def compare_content(dataset):
    input_file = "example.json"
    stream = JSON(input_file)
    with open('test.html') as txt:
        html_template = txt.read()

    return {
        'dataset': dataset,
        'stream': stream,
        'update': update_db,
        'on_exit': on_exit,
        'exclude': [dataset],
        'view_id': 'html',
        'config': {'html_template': html_template}
    }

The problem is that I haven’t managed to save anything into the SQLite database. What am I missing here? I start the prodigy server with prodigy compare-content dataset -F recipe.py and after classifying examples in the Web APP I exit it with Ctrl+C

Hi! Your recipe looks good and you shouldn’t even need the on_exit and update methods. If your recipe returns a 'dataset' ID, all annotations you collect will be saved to this dataset automatically and you won’t have to do anything.

What happens when you run the db-out command with your dataset name? For example:

prodigy db-out dataset | less

To get a better feeling for what’s going on under the hood, you can also always set the environment variable PRODIGY_LOGGING=basic, which will output logging info for the individual components.

My dataset given in the input_file is a json with a bunch of entries like this
{"source": {"title": "title A", "content": "Text A"}, "output": {"title": "title B", "content": "Text B"}}
which are understood by the mustache template and correctly displayed. I explain that because the output of PRODIGY_LOGGING=basic is strange:

15:43:52 - DB: Initialising database SQLite
15:43:52 - DB: Connecting to database SQLite
15:43:52 - RECIPE: Calling recipe 'compare-content'
15:43:52 - CONTROLLER: Initialising from recipe
15:43:52 - DB: Loading dataset 'dataset' (0 examples)
15:43:52 - DB: Creating dataset '2018-05-22_15-43-52'
15:43:52 - CONTROLLER: Getting hashes for excluded examples
15:43:52 - CONTROLLER: Excluding 0 tasks from datasets: dataset

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

15:45:21 - GET: /project
15:45:21 - GET: /get_questions
15:45:21 - CONTROLLER: Iterating over stream
15:45:21 - CONTROLLER: Returning a batch of tasks from the queue
15:45:21 - RESPONSE: /get_questions (10 examples)
15:45:26 - GET: /get_questions
15:45:26 - CONTROLLER: Returning a batch of tasks from the queue
15:45:26 - RESPONSE: /get_questions (10 examples)
15:45:27 - GET: /get_questions
15:45:27 - CONTROLLER: No more batches available from the queue
15:45:27 - RESPONSE: /get_questions (0 examples)
15:45:27 - GET: /get_questions
15:45:27 - CONTROLLER: No more batches available from the queue
15:45:27 - RESPONSE: /get_questions (0 examples)
15:45:27 - GET: /get_questions
15:45:27 - CONTROLLER: No more batches available from the queue
15:45:27 - RESPONSE: /get_questions (0 examples)
15:45:28 - GET: /get_questions
15:45:28 - CONTROLLER: No more batches available from the queue
15:45:28 - RESPONSE: /get_questions (0 examples)
^C15:45:46 - CONTROLLER: Calling recipe's on_exit() method
15:45:46 - DB: Loading dataset '2018-05-22_15-43-52' (0 examples)
Answers: []
15:45:46 - DB: Getting dataset 'dataset'
15:45:46 - DB: Added 0 examples to 1 datasets

I don’t get why a new dataset is being created with the timestamp. Also the final line reads that no examples are saved. This is the full log after launching the server classifying all the examples and then shutting it down.

For each annotation session, Prodigy creates an additional session dataset named after the timestamp. This is useful if you want to go back to a particular session, or exclude a "bad" session afterwards. Towards the end of the log, that session dataset is loaded again, because your custom on_exit function calls answers = ctrl.db.get_dataset(ctrl.session_id). But you should be able to safely remove both the on_exit and update callback, because Prodigy already takes care of saving everything to the database.

If I read this correctly, your recipe only has around 20 examples, right? If so, I think I know what the problem might be: Did you hit the "save" button (or cmd+s) before you closed the web app? Prodigy sends the annotations back in batches as you annotate, and keeps the latest 10 in the history so you can undo the decision quickly. If your stream is very short, you might not hit the autosave limit. (Usually, the web app should warn you, though, if there are unsaved annotations and the app is closed without saving manually.)

As a possible solution, I suggest the following steps:

  • Edit your recipe and take out the on_exit and update method.
  • Start a new annotation session and when you're done, hit the "Save" button or press cmd+s. You should see a notification that the annotations were saved successfully.
  • Exit the server.
  • Inspect your dataset again using db-out.

Thank you very much! I feel silly now that I saw the icon indicating that the annotations needed to be saved. Also reading the clear instructions in the README file :flushed:

No worries – glad it’s all working now! :+1:

In the future, we might use a more flexible system for saving and auto-saving, so you don’t have to remember to do it manually. But the current solution was the most straightforward, because it allows the user to edit the latest X annotations before they’re sent back to the server, so the Prodigy server never has to deal with reconciling the annotations and overwriting the corrected ones.

In practice, we have noticed almost every new annotator does not notice the save button. I wonder if it would help to modify the save button to also have a label like “End session”, which would save any unsaved annotations and have a callback associated that we can use to redirect the annotator to a different page.

@soumyagk Thanks for the feedback – I really like the redirect idea! I wonder if this should be a separate action, though? For example, a user might want to save their progress in between, without ending their session. So maybe we just need a separate menu that’s more intuitive and can include those kinds of secondary actions.

Btw, for the upcoming Annotation Manager, I’ve also been thinking about various strategies for annotator onboarding, and I hope we can bring some of those features and ideas to the main app as well :blush:

2 Likes