We want to use prodigy for purely annotation tasks in the beginning before we decide on what models to use etc.
I have created my own custom recipe that loads a CSV of reviews that is going to be annotated as hate-speach or OK and upload the annotation up to a central DB. I have this working and the setup was simple!
But i have now comed to the conclusion that maybe loading from a CSV is not ideal, since we are getting more and more reviews everyday that we want to annotate i was thinking of having an standalone script uploading the new examples into the DB directly and skip the CSV step.
So my question is, is it possible to append examples to a dataset in the DB, and prodigy will "hot-reload" them without requiring a restart of prodigy which i have to do now since im using a CSV file to load in the exampels.
And another question, under the dataset tabel in the DB the names are set to 2019-10-31_07-07-18 e.g the session id, is it possible to append some other identifier to that? or change it to something els?
Hi! Streams in Prodigy are regular Python generators – so you can set them up however you like and also make it respond to outside state, read from an external source (database, REST API) etc. For instance, here's a pseudocode example of loading data from something like a paginated API:
def custom_stream():
page = 0
while True:
examples = get_new_examples(page)
yield from examples
page += 1
You could also use the files in a directory and after each iteration (all examples in the file are sent out), check if there's a new file you can read from. I don't know where your original review data lives – but if you can retrieve it in Python, you could also do it directly in the recipe script, so you can skip the whole export step alltogether.
If you don't want to edit the recipe script (e.g. if you're using a built-in recipe), you can also write a custom loader script that writes to stdout and then pipe that forward. See here for an example.
Prodigy will typically create two datasets: one with the name you've given when you run the recipe and one timestamped dataset per session. In a custom recipe, you can also return a get_session_id callback to customise how the session IDs are generated.
You might also want to check out the named multi-user sessions (see the "Multi-user sessions" section in your PRODIGY_README.html for details). This allows you to append something like ?session=johannes to the URL in the web app and associate all annotations you collect with that session. You can also customise whether all sessions should see the same examples or whether everyone should see different questions.