prodigy review --auto-accept exhausting stream before all annotations saved to gold dataset

Hello!

I have collected a number of annotations that I'd like to review. The stats for the annotations dataset (let's call it ds) are as follows:

New Unique Total Unique
Aug 2021 278 278 278 278
Sep 2021 899 880 1177 1158
Oct 2021 327 322 1504 1480

I want to review the dataset into ds_gold to have a set of unique gold standard annotations, and anticipate that I'd have to review at most 1504-1480=24 annotations to get there.

I think the review --auto-accept flag is right for my use case, and used prodigy review ds_gold ds --auto-accept.

I'm running into a situation where the stream is exhausted and doesn't "refresh" until I restart the server. If I restart the server I get 23 annotations auto-accepted into the dataset and then a "No Tasks" message. Sometimes I get a duplicate annotation to review before restarting the server but usually not.

I think it could be because of the sparsity of annotations with conflicts. I am going to dive into the review recipe to check, but thought I'd log this issue here in case there's something obvious I'm missing. I'll update if I solve it.


Update
Here's the output of logging=basic, as I think that's pointed me to the fact that most of the action happens before I access the app. I need to look into what the controller and feed are supposed to be doing.

Running command: prodigy review ds_gold ds --auto-accept
20:05:12: INIT: Setting all logging levels to 20
20:05:12: RECIPE: Calling recipe 'review'
20:05:12: RECIPE: Starting recipe review
20:05:12: CONFIG: Using config from global prodigy.json
20:05:12: CONFIG: Using config from working dir
20:05:12: DB: Initializing database PostgreSQL
20:05:12: DB: Connecting to database PostgreSQL
20:05:12: DB: Loading dataset 'ds' (1504 examples)
20:05:12: RECIPE: Merged 1498 examples from 1 datasets
20:05:12: CONFIG: Using config from global prodigy.json
20:05:12: CONFIG: Using config from working dir
20:05:12: VALIDATE: Validating components returned by recipe
20:05:12: CONTROLLER: Initialising from recipe
20:05:12: VALIDATE: Creating validator for view ID 'review'
20:05:12: VALIDATE: Validating Prodigy and recipe config
20:05:12: DB: Creating dataset '2021-10-10_20-05-12'
20:05:12: FEED: Initializing from controller
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:12: DB: Getting dataset 'ds_gold'
20:05:12: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: DB: Getting dataset 'ds_gold'
20:05:13: DB: Added 1 examples to 1 datasets
20:05:13: CORS: initialized with wildcard "*" CORS origins

:sparkles: Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

INFO: Started server process [63004]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8080 (Press CTRL+C to quit)
INFO: ::1:58039 - "GET / HTTP/1.1" 401 Unauthorized
INFO: ::1:58040 - "GET / HTTP/1.1" 200 OK
INFO: ::1:58040 - "GET /bundle.js HTTP/1.1" 200 OK
20:05:46: GET: /project
INFO: ::1:58040 - "GET /project HTTP/1.1" 200 OK
20:05:47: POST: /get_session_questions
20:05:47: CONTROLLER: Getting batch of questions for session: None
20:05:47: FEED: Finding next batch of questions in stream
20:05:47: FEED: re-adding open tasks to stream
20:05:47: FEED: Stream is empty
20:05:47: FEED: adding tasks from other sessions to None queue.
20:05:47: FEED: batch of questions requested for session None: 0
20:05:47: RESPONSE: /get_session_questions (0 examples)
INFO: ::1:58040 - "POST /get_session_questions HTTP/1.1" 200 OK

Hello,
I tried out the new release and am still seeing this issue-- any tips on troubleshooting you might be able to offer?
Thanks as always!
Adam

Thanks for the detailed report, this is super strange :thinking: Especially since the auto-accepted examples are added in the stream, as examples are queued up for annotation. So it's no different from any other stream that does stuff within the generator.

Just to double-check, you don't have auto_count_stream set to True in your prodigy.json, do you?

Another thing to check: You can call the get_stream helper in prodigy.recipes.review directly and then inspect the merged examples to make sure that they're what you expect:

from prodigy.components.db import connect
from prodigy.recipes.review import get_stream

DB = connect()
view_id = "ner_manual"  # or whatever your view_id is
all_examples = {set_id: DB.get_dataset(set_id) for set_id in input_sets}
stream = get_stream(all_examples, view_id)
stream = list(stream)

Examples are not shown if eg["versions"] has a length of 1 (only one version of this example is available) and auto-added if eg["versions"][0]["sessions"] is greater than 1.

Hi @ines,

I did have auto_count_stream set to true in my prodigy.json! I set it to false but the problem persisted. I then figured I might as well restart the server a bunch of times if it was going to auto-load before accessing the UI to see if I could just add all of the resolved examples on startup. I did this until the count of items added to the db far exceeded the n of non-duplicate annotations, which told me the review recipe wasn't checking against the db for already-existing annotations.

At this point, I decided to power through without --auto-accept, and after a speed run of "a" key tapping things worked as expected (the correct number of resolved gold versions of annotations, the review recipe reporting no more examples to review after we hit that correct number). So-- if someone else hits this issue I'm happy to troubleshoot with them but am going to return to the golden path for now! Thanks for your help.

Adam

Oh, this is a good point! The review recipe will exclude from the stream based on what's in the dataset (via Prodigy's default mechanism) but this happens after the stream is set up. And we're applying the auto-adding in the stream, so it definitely needs a check against the hashes in the database so you don't end up with duplicates here. I'll add this fix for the next release :+1:

I don't immediately see how this could be related to the issue here, though, but it's still good this came up. If you want to include this update in the meantime, you could open recipes/review.py in your Prodigy installation, find filter_auto_accept_stream and modify it like this:

def filter_auto_accept_stream(
    stream: Iterator[Dict[str, Any]], db: Database, dataset: str
) -> StreamType:
    """
    Automatically add examples with no conflicts to the database and skip
    them during annotation.
    """
    task_hashes = db.get_task_hashes(dataset)
    for eg in stream:
        versions = eg["versions"]
        if len(versions) == 1:  # no conflicts, only one version
            if eg[TASK_HASH_ATTR] in task_hashes:
                continue
            sessions = versions[0]["sessions"]
            if len(sessions) > 1:  # multiple identical versions
                # Add example to dataset automatically
                eg["answer"] = "accept"
                db.add_examples([eg], [dataset])
            # Don't send anything out for annotation
        else:
            yield eg

Glad to hear you found a solution to keep working! If you're able to share your data (existing dataset in the DB + source file), even just privately via email, let me know! Then we can try it out and see if we can reproduce it :slight_smile:

Thanks @ines ! I'll email the team with a jsonl export of the dataset in question and steps to reproduce-- I just installed with the newest spacy to confirm the issue persists, and indeed it does.

Best,

Adam

Hi @ines !

Is there any update on this issue?

I tried using the review recipe on a dataset with conflicting annotations.
With the --auto-accept option on, the only annotations that got added to my new dataset were the once-conflicting-now-resolved annotations. The only way I could get all of the original annotations (including the resolved annotations) in my new dataset was to run the review recipe again without --auto-accept and manually accept each annotation.

The filter function I included above was shipped with Prodigy v1.11.6. So maybe double-check that you have this version installed?

We're running version 1.11.7.
Is there something else that could be wrong?

This issue never resolved itself for me— Back when I was first having the issue, I sent the team my data to see if they could reproduce it. Happy to do so again if helpful.