Manual annotations are saved, but not in a dataset? How to export them?

Hi!

I think I basically have two main questions:

  1. Why does prodigy show that we have 0 datasets (should be 1, I guess)? (And: Did we do something wrong or is this by design?)
  2. How can we export the annotations? prodigy db-out datasetname does not work, since there is no dataset.

Some background info: We created the annotation tasks following the tutorial by issuing:

prodigy newstsa polnewstargetsentiment -F path/newstsarecipe.py path/anno.jsonl

Also, three coders already worked on the annotations, we have more than 700 annotations currently. I do see these annotations when viewing prodigy.db in a SQLite viewer. However, prodigy states that there are 0 datasets:

root@077935fff106:/prodigy/home# prodigy stats -ls

  ?  Prodigy stats

Version          1.8.4
Location         /usr/local/lib/python3.6/site-packages/prodigy
Prodigy Home     /root/.prodigy
Platform         Linux-4.15.0-64-generic-x86_64-with-debian-10.1
Python Version   3.6.9
Database Name    SQLite
Database Id      sqlite
Total Datasets   0
Total Sessions   0

Here's how one of the annotation tasks looks like in DB (note that the id from before polnewstargetsentiment occurs here, too - so I guess the command from above worked well?).

{"targetphrase":"sometext","text":"sometext,","html":"somehtml,","options":[{"id":"positive","text":"\ud83d\ude0a positive"},{"id":"neutral","text":"\ud83d\ude36 neutral"},{"id":"negative","text":"\ud83d\ude41 negative"},{"id":"posneg","text":"\ud83d\ude0a+\ud83d\ude41 pos. and neg."}],"_input_hash":-624773216,"_task_hash":868511941,"_session_id":"polnewstargetsentiment-timo","_view_id":"choice","accept":["neutral"],"answer":"accept"}

Thank you in advance!

Cheers,
Felix

Hi! There are two things to check here:

  1. Does your recipe in newstsarecipe.py pass through and return the name of the dataset polnewstargetsentiment and return it as the "dataset" setting? This is how Prodigy knows where to save the annotations. (When you first start the server, the dataset is created if it doesn't exist – but it's often a good idea to explicitly run prodigy dataset to add a new set, to make sure everything it set up correctly.)
  2. When you start up the server for your annotators, does it always run under the same user account / write to the same DB? By default, a database prodigy.db the Prodigy home directory (.prodigy in the user home) is created and used. But if you're starting the server under different user accounts for instance, it may create a separate database for each user. In that case, you probably want to configure the database settings to make sure you're always writing to the same DB.
1 Like

Hi Ines!

Regarding your first question: I guess so, see:

@prodigy.recipe('newstsa',
                dataset=prodigy.recipe_args['dataset'],
                file_path=("Path to texts", "positional", None, str))
def sentiment(dataset, file_path):
    """Annotate the sentiment of texts using different mood options."""
    stream = JSONL(file_path)  # load in the JSONL file
    stream = add_options(stream)  # add options to each task

    return {
        'dataset': dataset,  # save annotations in this dataset
        'view_id': 'choice',  # use the choice interface
        "config": {
            "choice_auto_accept": True,  # auto-accept example, once the users selects an option
            "instructions": "/prodigy/manual.html"
        },
        'on_exit': on_exit,
        'stream': stream,
    }

Regarding the second question: I think so, too. On the server, there is only one user (root) and the Prodigy home dir is set via a environment variable.

Thanks for the update – the recipe definitely looks correct :+1:

When you're running prodigy stats and prodigy db-out, are you setting that environment variable, too? And is the location shown in the stats (/root/.prodigy) correct?

Since you can see the tables and data in the SQLite browser, I think the most likely explanation is that the database you're loading here is not the same one that the annotations were saved to. Under the hood, the database commands only really do something like this:

from prodigy.components.db import connect
import srsly

db = connect()
examples = db.get_dataset("dataset_name")
srsly.write-jsonl("/path/to/data.jsonl", examples)
1 Like

Alright, that's what I got wrong: when running prodigy in "server mode" so that people can use it to annotate our data, I set the env variables, but when issuing db stats and the like, I'm not; hence, the differences in the output of the db stats command, which shows 0 datasets without the environment variable, but as expected shows 1 dataset when the environment variable is set properly. I'm sorry for the confusion, I totally missed that the stats command already showed that there was a different home path. And thank you for the help!

1 Like

No worries, glad it's working now! :+1:

this doesn't seem like a valid function name

this works

with open(filename, "w") as fh:
    for ex in examples:
        fh.write(srsly.json_dumps(ex) + "\n")

Ah, sorry, that was of course supposed to be write_jsonl! Also see here: GitHub - explosion/srsly: 🦉 Modern high-performance serialization utilities for Python (JSON, MessagePack, Pickle)