auto_exclude_current not respected?

rbange · June 11, 2021, 8:25am

Hi,
we are using a custom receipe which continuously feeds database data to our labeling tasks:

{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "postgres"
    },
    "feed_overlap": false,
    "force_stream_order": true,
    "auto_exclude_current": false
  }

}

def get_unlabeled_items(query):
    conn = psycopg2.connect("")  # pick from env
    assert query

    try:
        while True:
            with conn:
                with conn.cursor() as curs:
                    curs.execute(query)
                    items = curs.fetchall()

            prodigy.util.msg.text(f"Queried {len(items)} items...")
            if len(items) == 0:
                time.sleep(10)
            for item in items:
                yield item[0]
    except Exception as e:
        conn.close()
        raise e


@prodigy.recipe("custom.ner")
def unbind_ner_label() -> Dict[str, Any]:
    stream = get_unlabeled_items("select * from api.get_ner_tasks();")
    nlp = spacy.blank("en")

    stream = prodigy.components.preprocess.add_tokens(nlp, stream, use_chars=False)

    return {
        "view_id": "ner_manual",
        "dataset": "ner",
        "stream": stream,
        "config": {"labels": ["brand", "quantity"]},
    }

Question 1: Even though "auto_exclude_current is set to false - recurring items are still deduplicated and therefore "hang" within the stream as we want to label them twice.

Question 2: If the stream has 0 items, prodigy will show a Loading... text instead of a No new task available, which is shown if the generator is exhausted. Its more a cosmetic issue, but is it possible to recreate this behavior while keeping our unending generator implementation?

btw: the documentation states that feed_overlap defaults to false while in your code the true default is true?

Best,
Roman

ines · June 15, 2021, 1:36am

Hi! The auto_exclude_current setting refers to excluding annotations that are already present in the current dataset – is this your main goal, or do you also have to deal with duplicates within the same stream and session? "auto_exclude_current": true should exclude all task hashes that are already present in the current dataset that the annotations are saved to – so if your stream produces the same task hash again and it's in the dataset, it should still be shown if you set "auto_exclude_current": false (if not, that'd definitely be confusing).

If your goal is to label the same example multiple times and have your own definition of what's a duplicate, one option you could consider is using custom hashes that express this: so every example that goes out and that you want to present for annotation receives a unique _task_hash, even if its contents are the same as something you've previously annotated. This gives you a lot of flexibility, and it might also make some of the interactions easier to control and reason about.

I need to double-check this, but it might just be a side-effect of the lenght-0 stream causing the back-end to raise an error, so Prodigy stays in "Loading", since it doesn't get a response to its request.

We definitely want to add more fine-grained control of this, including blocking streams, in the future, but one workaround would be to ensure that your generator is never exhausted by keeping it stuck in a loop, or keep sending out a "dummy" task (e.g. with a hash that's excluded) to keep it busy while you're queiung up more in the background. You could also experiment with different batch sizes: Prodigy will ask for new examples if there's less than half a batch left in the queue (or fewer than 2 examples total) – so with a slightly larger batch size, you'd have more time to queue up new examples in the background.

The setting used to default to true, up to v1.10.5, which changed it to false (as this is the more reasonable default behaviour).

Topic		Replies	Views
No more tasks in ordered DB stream database	7	21	December 17, 2024
ner.correct --exclude not excluding duplicate tasks bug , ner	17	1827	December 7, 2021
How does exclusion of already-seen tasks work? usage , streams	1	464	March 29, 2021
Feed overlap issue (latest release) usage , ner , solved	7	1003	January 19, 2022
feed_overlap bug? done	7	1307	July 2, 2019

auto_exclude_current not respected?

Related topics