How to get the count of unlabeled / unannotated data?

Was wondering how to get a count of unlabelled annotations via the Python API?

I know I can do this to get the total labelled annotations:

from prodigy.components.db import connect

db = connect()
all_dataset_names = db.datasets
some_dataset = all_dataset_names[0]
annotations_count = db.count_dataset(some_dataset)

But I want to be able to display and show progress in terms of the total number of annotations that are still unlabelled. Is that possible?

Hi @strickvl ,

Have you seen the total_examples_target config setting? If you know upfront how many examples you have to annotate, this will express progress as quotient of number of examples already annotated and the configured target. If it's a multisession scenario, it would consider the total by a given session.

If you want something slightly different, you can always define your custom progress callback and return it as progress component of a recipe. There's an example in the same documentation section I linked above.
And finally, you can access the total annotated examples via Controller object (probably easier than connecting to the DB):

# total annotated examples
controller.total_annotated

# total annotated for a particular session
controller.get_total_by_session(session.id)

All of those are interesting, but I think I wouldn't want to assume that there is an active session. Actually, I wouldn't want to assume that Prodigy is even running anywhere, which is why I wanted to go straight to the database.

But perhaps here's a disconnect between my expectation (from other annotators) where you usually have a bunch of examples that you want to annotate, and then you have the count of the number you've already successfully annotated. With Prodigy every time you start a session, as I understand it, that list of examples that you can annotate might change.

For now (since I wanted this information for the ZenML integration I'm building), I've just disabled the option to show the unlabelled example/task count (which we have enabled for the other annotators with which we integrate).

So, just to be clear, I want to find that value not for any Prodigy callback etc, but just so as to integrate with ZenML. (i.e. Prodigy Annotator Integration by strickvl · Pull Request #2655 · zenml-io/zenml · GitHub)

1 Like

Actually, I wouldn't want to assume that Prodigy is even running anywhere, which is why I wanted to go straight to the database.

In that case, yep, makes total sense.

Yes, it might change as a function of other annotators behavior. The default setup is meant to make sure the annotation progresses as fast as possible given the pool of annotators available. The total of examples annotated by a session will be updated with the DB state every time this session reconnects. Also, this behavior can be fully customized, via custom session factory and/or routers. You potentially could pre-asign task to annotators so that they work towards their goals independently of one another.