Deleting examples from DB

I have an iterative workflow process where I need to delete examples that were rejected. Is it possible to easily delete those examples from the DB or should I do the following?

  1. db-out
  2. filter on accepted
  3. drop and create dataset
  4. db-in

Your approach seems fine, or you could write a custom recipe to do it:

recipe.py

from typing import List
import prodigy
from prodigy.components.db import Database, Dataset, Example, Link
from prodigy.util import log, print_stats


@prodigy.recipe("cleanup")
def basic_classification(dataset: str):
    DB: Database = prodigy.components.db.connect()
    if dataset not in DB:
        raise ValueError(f"dataset {dataset} does not exist!")
    dataset_id = Dataset.get(Dataset.name == dataset).id
    links: List[Link] = list(Link.select().where(Link.dataset == dataset_id))
    to_delete: List[Link] = []
    for link in links:
        content = link.example.load()
        if content["answer"] == "reject":
            to_delete.append(link)
    example_ids = [l.example.id for l in to_delete]
    link_ids = [l.id for l in to_delete]
    log(f"CLEANUP: Trashing {len(example_ids)} examples and {len(link_ids)} links")
    trash_examples = [l.example.load() for l in to_delete]
    trash_file = DB.add_to_trash(trash_examples, dataset)
    log(f"CLEANUP: Examples moved to trash: {trash_file}")
    Link.delete().where(Link.id << link_ids).execute()
    Example.delete().where(Example.id << example_ids).execute()
    log(f"CLEANUP: Examples and links removed from database")
    print_stats(
        title="Trash rejected examples",
        no_format=False,
        stats={"Dataset": dataset, "Removed": len(example_ids), "Trash": trash_file},
    )

Try it

data.jsonl

{ "text": "1", "label": "TEST", "answer":"reject" }
{ "text": "2", "label": "TEST", "answer":"reject"  }
{ "text": "3", "label": "TEST" }

test_recipe.sh

#!/bin/bash
prodigy dataset test-dataset "test for removing rejected examples"
prodigy db-in test-dataset ./data.jsonl
prodigy cleanup test-dataset -F ./recipe.py
prodigy stats test-dataset

The upside to a custom recipe is that the examples can be added to the prodigy trash before being removed, so you can recover them if need be:

  ✨  Trash rejected examples

Dataset   test-dataset                  
Removed   2                             
Trash     /yourpath/trash/test-dataset.jsonl

Ah yes of course. I keep forgetting to use recipes for more than just labelling tasks. Thanks @justindujardin

Hi,

@nix411, just curious, what might be reason why you would want to delete ‘rejected’ examples? Thinking that a mixture of ‘accepts’ and ‘rejects’ are useful in model’s learning.

Sure.

I am doing information extraction. Now I am verifying that the information being extracted is correct or not.

  • accept: use the extracted information to create unit tests for my application.
  • reject: implement a fix. Rerun the classification on the failing ones.

I hope it makes sense - maybe there is a better workflow though!?