How to get the count of unlabeled / unannotated data?

strickvl · May 1, 2024, 9:59am

Was wondering how to get a count of unlabelled annotations via the Python API?

I know I can do this to get the total labelled annotations:

from prodigy.components.db import connect

db = connect()
all_dataset_names = db.datasets
some_dataset = all_dataset_names[0]
annotations_count = db.count_dataset(some_dataset)

But I want to be able to display and show progress in terms of the total number of annotations that are still unlabelled. Is that possible?

magdaaniol · May 3, 2024, 8:31am

Hi @strickvl ,

Have you seen the total_examples_target config setting? If you know upfront how many examples you have to annotate, this will express progress as quotient of number of examples already annotated and the configured target. If it's a multisession scenario, it would consider the total by a given session.

If you want something slightly different, you can always define your custom progress callback and return it as progress component of a recipe. There's an example in the same documentation section I linked above.
And finally, you can access the total annotated examples via Controller object (probably easier than connecting to the DB):

# total annotated examples
controller.total_annotated

# total annotated for a particular session
controller.get_total_by_session(session.id)

strickvl · May 3, 2024, 8:47am

All of those are interesting, but I think I wouldn't want to assume that there is an active session. Actually, I wouldn't want to assume that Prodigy is even running anywhere, which is why I wanted to go straight to the database.

But perhaps here's a disconnect between my expectation (from other annotators) where you usually have a bunch of examples that you want to annotate, and then you have the count of the number you've already successfully annotated. With Prodigy every time you start a session, as I understand it, that list of examples that you can annotate might change.

For now (since I wanted this information for the ZenML integration I'm building), I've just disabled the option to show the unlabelled example/task count (which we have enabled for the other annotators with which we integrate).

strickvl · May 3, 2024, 8:49am

So, just to be clear, I want to find that value not for any Prodigy callback etc, but just so as to integrate with ZenML. (i.e. Prodigy Annotator Integration by strickvl · Pull Request #2655 · zenml-io/zenml · GitHub)

magdaaniol · May 6, 2024, 7:28pm

Actually, I wouldn't want to assume that Prodigy is even running anywhere, which is why I wanted to go straight to the database.

In that case, yep, makes total sense.

Yes, it might change as a function of other annotators behavior. The default setup is meant to make sure the annotation progresses as fast as possible given the pool of annotators available. The total of examples annotated by a session will be updated with the DB state every time this session reconnects. Also, this behavior can be fully customized, via custom session factory and/or routers. You potentially could pre-asign task to annotators so that they work towards their goals independently of one another.

Topic		Replies	Views
total_examples_target pulls the number of docs in the dataset instead of being hard coded usage	1	361	September 16, 2022
UI annotation progress does not match number of examples in database usage , database , server	8	562	December 13, 2021
Total annotated queries keep getting reset to 0 usage	4	383	August 12, 2020
Getting access to annotations before placed in db usage , database , custom , solved	8	2033	October 31, 2019
Problem in visualization of the annotation progress bug , streams	1	786	February 11, 2020

How to get the count of unlabeled / unannotated data?

Related topics