Making sure "update" is called before iterating the "stream"

We (@Jpetiot and I) are currently working on a pyannote recipe to perform interactive speaker diarization.

For reasons that I can further explained if needed, we need the model in the loop (basically, a numpy array containing speaker embeddings) to be updated after every human annotation in the update callback. To do so, we went with batch_size = 1 and instant_submit = True.

However, it appears that Prodigy calls next(stream) before calling update (or maybe the latter is done asynchronously because this behavior is apparently not deterministic):

task = next(stream)
answer = manual_annotation(task)
task = next(stream)  # <-- sometimes called before `update`
update([answer])    

Is there way to enforce the following sequence of actions?

task = next(stream)
answer = manual_annotation(task)
update([answer])  # <-- before calling subsequent `next(stream)`
task = next(stream)
answer = manual_annotation(task)
update([answer])
...

Thanks!

Hervé.

Yeah, this is all async. At the moment, there's no guarantee that the update is called before the next batch is requested because those two calls are separate and we're not blocking on update. So if executing the callback takes longer, Prodigy might send out the next batch first.

Out of curiosity, why is it a must that the model update immediately after you label data? I might be able to suggest a "hack" if I understand your situation better.

Thanks for your answer... though this is exactly what I was afraid of :slight_smile:

A bit of context. Speaker diarization is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. This is not a speaker recognition task as we do not care about the actual identity of the speaker -- we just want to assign the same label (e.g. SPEAKER_A, SPEAKER_B, ...) to every speech turn of the same speaker. In machine learning jargon, it can be seen as a clustering task rather than a classification task.

Our current solution. Because audio recording can be very long, it does not make much sense to show the whole recording at once to the annotators -- this would make each task unbearably difficult. Instead, we model the problem as an online clustering task where the audio is presented to the annotator in short (10s or so) chunks in chronological order. For the first chunk, the annotator does the job on their own. But starting as early as the second chunk, it is pre-annotated by assigning speech turns to the most similar speaker from the previous chunks, based on speaker embeddings computed in update.

Why update needs to be called synchronously? The annotator also has the option to decide that the current chunk contains a speaker that has never been heard before. This is where having update behave as described in the original issue is critical. The new speaker must exist in the internal set of speaker embeddings to be recognized in the subsequent chunk. We cannot wait for a few chunks to pass to start recognizing this new speaker (or this would result in this new speaker being incorrectly split into two new speakers).

A "hack" we thought about was to actually use validate_answer instead of update for that purpose but this does not feel right (even though it would certainly solve our problem) and an option to make update blocking would be a great addition.

Given your situation, I think exploring validate_answer certainly makes sense. But I agree that it's a bit hacky.

Part of me is also wondering if you can perhaps split up the labelling task into two batches. The first batch might include the segments of the first 10 seconds of each audio clip. When you receive enough of these you can use them to pre-fill segments that follow, which creates the second batch.

I'm mentioning this because I have found that a "semi-active"/"offline"/"batch" approach to labelling can sometimes be very pragmatic. I can imagine that if you have an audio segment of 20 minutes that, after labelling the 1st 10 seconds, you might just want to run a batch job to pre-fill the rest of the audio segment. It would be some manual labour to run the batch job in between. But given long enough sequence of audio, it might still be fine.

Thanks.

Part of me is also wondering if you can perhaps split up the labelling task into two batches. The first batch might include the segments of the first 10 seconds of each audio clip. When you receive enough of these you can use them to pre-fill segments that follow, which creates the second batch.

This would also have the side effect that annotators would somehow loose context.
Identifying speakers by their voice is already quite complex, you don't want to expose them to multiple conversations at once.

Also, this won't work in our use case for two main reasons:

  1. a speaker may appear later in the conversation (i.e. in the second batch)
  2. we have other constraints that only allow one conversation at a time

But, I'll keep this idea in mind for later recipes -- I am sure it might prove useful in other situations.

So: validate_answer it is, then...

... until update can be made synchronous through a configuration flag?
Is this something the Prodigy team would even consider?