Thanks for the report and suggested fix! It's true that using NOT IN with a subquery can be inefficient, especially on large tables.
I actually rewritten the query using LEFT OUTER JOIN as NOT EXIST did not work as expected due to the Link table structure.
Could you check if this version of the query solves the performance issue for you?
Another function that could use similar refactor is delete_orphan_examples:
def delete_orphan_examples(self) -> None:
"""Delete examples that are not part of any dataset (session or otherwise)"""
def delete_orphans(model, link_model, link_field):
query = (
Example.select(Example.id)
.join(Link, orm.JOIN.LEFT_OUTER, on=(Link.example_id == Example.id))
.where(Link.example_id.is_null(True))
)
Example.delete().where(Example.id.in_(query)).execute()
st_query = (
StructuredExampleModel.select(StructuredExampleModel.id)
.join(
StructuredLinkModel,
orm.JOIN.LEFT_OUTER,
on=(StructuredLinkModel.example_id == StructuredExampleModel.id),
)
.where(StructuredLinkModel.example_id.is_null(True))
)
StructuredExampleModel.delete().where(StructuredExampleModel.id.in_(st_query)).execute()
While we prepare the patch, to try this out feel free to edit your local copy of db.py. One of the reasons we distribute Prodigy as pure Python is to make it easier for license holders to debug, create custom integrations, or develop complex custom recipes. I really appreciate your initiative in opening a PR—it's great to see community engagement! At the same time, please keep in mind that Prodigy remains closed source under a commercial license; the code access we provide is intended solely for licensed users, not to make the software fully public. Thank you!
If get_orphan_example_ids is still hanging for anyone, it's usually because of how Postgres handles LEFT OUTER JOINs on big tables. Even with the null filter, the planner can decide to do a sequential scan if Link.example_id and StructuredLinkModel.example_id don't have the right indexes.
Adding an index to those columns usually cuts the time it takes to run from minutes to seconds.
Also, be careful when you run queries inside Python loops. For example, doing Example.id.in_(query) on a huge orphan list can use up a lot of memory.
In most cases, it's much faster to let Postgres delete things with a single SQL statement that includes the join than to get the ids in Python first.