get_orphan_example_ids hangs on Postgres

Hello!

After I upgraded to prodigy 1.18, drop_dataset stopped working. It hangs inside of get_orphan_example_ids.

It seems that the following query is taking forever:

SELECT "t1"."id" 
FROM "example" AS "t1" 
WHERE NOT ("t1"."id" IN (SELECT "t2"."example_id" FROM "link" AS "t2"))

Here’s the query plan:

Seq Scan on public.example t1  (cost=0.00..811368584.94 rows=94758 width=4)
    Output: t1.id
    Filter: (NOT (SubPlan 1))
    SubPlan 1
        ->  Materialize  (cost=0.00..7768.02 rows=317801 width=4)
              Output: t2.example_id
              ->  Seq Scan on public.link t2  (cost=0.00..4937.01 rows=317801 width=4)
                   Output: t2.example_id
Query Identifier: -7075963186874869127

Any chance it can be replaced with NOT EXISTS?

For now, I am just setting USE_OLD_DROP to True.

Also, since you guys took the database and other code out of cython, do you have it on github? I could do a PR if I figure out NOT EXISTS with peewee :slight_smile:

Hi @nrodnova!

Thanks for the report and suggested fix! It's true that using NOT IN with a subquery can be inefficient, especially on large tables.
I actually rewritten the query using LEFT OUTER JOIN as NOT EXIST did not work as expected due to the Link table structure.
Could you check if this version of the query solves the performance issue for you?

def get_orphan_example_ids(self) -> List[int]:
     query = (
         Example.select(Example.id)
         .join(Link, orm.JOIN.LEFT_OUTER, on=(Link.example_id == Example.id))
         .where(Link.example_id.is_null(True))
        )
        orphans = [eg.id for eg in query.execute()]
        st_query = (
            StructuredExampleModel.select(StructuredExampleModel.id)
            .join(
                StructuredLinkModel,
                orm.JOIN.LEFT_OUTER,
                on=(StructuredLinkModel.example_id == StructuredExampleModel.id),
            )
            .where(StructuredLinkModel.example_id.is_null(True))
        )
        orphans.extend(eg.id for eg in st_query.execute())
        return orphans

Another function that could use similar refactor is delete_orphan_examples:

 def delete_orphan_examples(self) -> None:
     """Delete examples that are not part of any dataset (session or otherwise)"""
     def delete_orphans(model, link_model, link_field):
         query = (
             Example.select(Example.id)
             .join(Link, orm.JOIN.LEFT_OUTER, on=(Link.example_id == Example.id))
             .where(Link.example_id.is_null(True))
         )
         Example.delete().where(Example.id.in_(query)).execute()

         st_query = (
             StructuredExampleModel.select(StructuredExampleModel.id)
             .join(
                 StructuredLinkModel,
                 orm.JOIN.LEFT_OUTER,
                 on=(StructuredLinkModel.example_id == StructuredExampleModel.id),
            )
            .where(StructuredLinkModel.example_id.is_null(True))
         )
         StructuredExampleModel.delete().where(StructuredExampleModel.id.in_(st_query)).execute()

While we prepare the patch, to try this out feel free to edit your local copy of db.py. One of the reasons we distribute Prodigy as pure Python is to make it easier for license holders to debug, create custom integrations, or develop complex custom recipes. I really appreciate your initiative in opening a PR—it's great to see community engagement! At the same time, please keep in mind that Prodigy remains closed source under a commercial license; the code access we provide is intended solely for licensed users, not to make the software fully public. Thank you!

Hi @nrodnova,

Just a quick update that that the solution similar to what you suggested was shipped in Prodigy 1.18.3.
Thanks again for your report!

Hey!!!

If get_orphan_example_ids is still hanging for anyone, it's usually because of how Postgres handles LEFT OUTER JOINs on big tables. Even with the null filter, the planner can decide to do a sequential scan if Link.example_id and StructuredLinkModel.example_id don't have the right indexes.

Adding an index to those columns usually cuts the time it takes to run from minutes to seconds.

Also, be careful when you run queries inside Python loops. For example, doing Example.id.in_(query) on a huge orphan list can use up a lot of memory.

In most cases, it's much faster to let Postgres delete things with a single SQL statement that includes the join than to get the ids in Python first.