Feature request: timestamps for data entry

It would be great to have timestamps for each data entry. This could allow me to run an analysis on how long it actually takes to annotate the dataset. A simple timestamp column in the example table could be enough.

In fact, I'm considering just hacking that myself and add this column via

alter table example
add column created_at timestamp not null default CURRENT_TIMESTAMP;

Do you think that would work?

1 Like

Hi! This is a good point and I've actually been thinking about this as well. We should probably add this to the JSON, maybe using a key like "_ts". It does come down to the event we want to log the timestamp for โ€“ from what I can see, there are 3 options:

  • add the timestamp when the example is created/sent out: this is easy to add (even by the user in a custom stream), although it's likely the least interesting one because it just tells you when the batch was processed by the back-end. It also means that examples in a batch likely end up with almost identical timestamps.
  • add the timestamp when the user locks in an answer in the UI: this is probably closest to what people would expect because if reflects when the user clicked a button. That said, a user may answer a question, step away from the computer for a 2 hours and then submit, so the timestamp of the example would be that of 2 hours before the example ended up in the DB
  • add the timestamp when the annotations are sent back or saved to the DB: this is closer to the approach you were thinking of and it reflects when you actually got the example. Since examples are sent back in batches, it does mean that all examples in a batch will end up with almost identical timestamps. It doesn't technically reflect the annotation time, though, since a user may take a break and an example is only sent back once a bach of answers is full.

We could also consider adding all three of those timestamps to the data โ€“ but I want to be careful here and make sure we don't add too much bloat to each individual record.

It might! But I'm not 100% sure whether this can cause unintended side-effects, so I'd be very careful about modifying columns and tables. But if you have a backup, it's definitely something you can try!

Thanks for your quick reply, Ines. Option 2 is what I'm looking for. Right now I have the instant-save option set to True, so options 2 and 3 are identical in my use case.

I solved this now by implementing this in my recipe:

def before_save(examples):
    created_at = datetime.datetime.now().replace(microsecond=0).isoformat()
    for eg in examples:
        eg["created_at"] = created_at
    return examples

# And in my actual recipe:
return (...
   "before_db": before_save
   ...)

Hi Ines! I'd like to offer my perspective, which is informed by some experience in multilingualism research. Timestamps and logging would be an awesome feature!

Beyond the events you mentioned, it can often be really helpful to even store the whole history of what a user did when. That info can really help to e.g. assess whether a task was well-formulated. Like when a test subject repeatedly changes their answer, you might want to check whether the instructions were clear or the tag set was ambiguous. Sometimes, it's the task itself that poses the challenge, but this kind of info can really help to quality control the task formulation, leading to much improved data collection.

I realise this might be a bigger feature to implement, but I am sure it would be much appreciated. I would definitely love it!

This is a good point and an interesting suggestion :100: The only aspect here that could be tricky is that the task data can easily get very verbose if it's tracking the entire task on every update.

Mostly thinking out loud, but you might be able to implement something like this by listening to the prodigyupdate event and adding a task property "history" with timestamped versions of the given task:

document.addEventListener('prodigyupdate', event => {
    const { task } = event.detail
    const history = task.history || {}
    // Add entry with timestamp and selected properties you want to track
    history[Date.now()] = { spans: task.spans, answer: task.answer }
    window.prodigy.update({ ...task, history })
})

This way, you can decide which values you care about (e.g. "spans" and "answer", or "label" etc.). And later on, you can even run some automated diagnostics, e.g. if the goal is named entities and the history is significantly longer than the final number of "spans", it can indicate that the annotator went back and forth a lot. And then you can look at the result in more detail.

2 Likes

Ines, thank you so much. This is a very helpful snippet for learning how to get and update the task from the frontend. I think these things are a tad hard to learn from the docs alone. It would be terrific if there was a sample recipe that showcased how to do this.

I think having configurable logging and introspection would be a fantastic feature for Prodigy down the road. One could even automatically analyse annotators' behaviours and highlight when a task formulation looks like it could be improved. And if one could highlight where time is spent, that also opens up many possibilities for improving processes and making them faster.

2 Likes

@ines How would you implement this? Is there a callback that allows us to do this?

I've tried validate_answer with no success. I could use before_db, but I would prefer if I could actually register the timestamp when the user accepted the answer

If you want to log the immediate action in the UI, you can do that with custom JavaScript: Custom Interfaces ยท Prodigy ยท An annotation tool for AI, Machine Learning & NLP prodigyupdate is fired on every update to the annotation task, and prodigyanswer when the answer is selected. Also see the example I posted above that logs the entire history of a given example in the UI and adds it as JSON.

Yeah, validate_answer is only intended to check the example โ€“ it won't let you modify the example in place. That's intended because otherwise, a subtle bug in a function here could potentially destroy data, and it'd mean that the validation endpoint would have to send data back and replace the example in the UI, which is very unintuitive.

(That said, you could, in theory, use the validate_answer function to store additional event data separately. For example, have a separate database where you log the hash of the example and the timestamp.)

1 Like

Just released Prodigy v1.11, which now automatically adds a _timestamp key to all examples! We've also added a new progress command that calculates the annotation progress over time for different intervals, using the _timestamp if available (and the dataset creation time as a fallback): https://prodi.gy/docs/recipes#progress

2 Likes