How does exclusion of already-seen tasks work?

Hi all! I would like to understand when and how exactly Prodigy excludes tasks it has already seen. Is it that the server fetches tasks from its stream and then filters them before sending them out? If so, does this cause batch sizes to shrink? I have observed batches fall short of the configured value and such deduplication would explain some of my observations. The reason I'm asking is I was struggling to get my progress display right.

Also, how can I disable this feature? I have tried setting auto_exclude_current to false, but in my tests, tasks seem to still be filtered based on their hashes.

By default, Prodigy will use the hashes to exclude identical examples already present in the current dataset, as well as duplicate examples in the stream. This happens via filters functions applied to the stream generator. The batching is the last thing that happens, right before the questions are sent to the server, so any filtering that happens before that shouldn't impact the batch size.

Setting auto_exclude_current should prevent Prodigy from excluding examples if they're hashes are already in the dataset. Where the exclusion logic gets slightly more complex is if you have multiple named sessions making requests to the same instance: here you may have the app requesting a new batch without hashes X, Y and Z, because it already has those on the queue. (But from what you describe, it sounds like this might not be that relevant in your case.)