Deleting examples from DB

nix411 · July 3, 2019, 2:43pm

I have an iterative workflow process where I need to delete examples that were rejected. Is it possible to easily delete those examples from the DB or should I do the following?

db-out
filter on accepted
drop and create dataset
db-in

justindujardin · July 3, 2019, 6:41pm

Your approach seems fine, or you could write a custom recipe to do it:

recipe.py

from typing import List, Dict
import prodigy
from prodigy.components.db import Database, Dataset, Example, Link
from prodigy.util import log, print_stats


@prodigy.recipe("cleanup")
def remove_rejected_examples(dataset: str):
    DB: Database = prodigy.components.db.connect()
    if dataset not in DB:
        raise ValueError(f"dataset {dataset} does not exist!")
    dataset_id = Dataset.get(Dataset.name == dataset).id
    links = Link.select(Link.example).where(Link.dataset == dataset_id)
    to_delete: List[Link] = []
    invalid_link_ids: List[int] = []
    for link in links:
        try:
            content = link.example.load()
            if content["answer"] == "reject":
                to_delete.append(link)
        except Example.DoesNotExist:
            # If we find a broken link, remove it
            invalid_link_ids.append(link.id)

    # Grab ALL the links for the examples we want to remove, and
    # see how many references there are to each example. If there
    # are only two, we'll remove the example along with its links.
    links = Link.select().where(Link.example_id << [l.example.id for l in to_delete])
    link_counts: Dict[str, int] = {}
    for link in links:
        key = link.example_id
        if key not in link_counts:
            link_counts[key] = 0
        link_counts[key] += 1
    # If there are two or fewer links to this example it's okay to remove it.
    link_example_ids = [k for k, v in link_counts.items() if v <= 2]
    example_links = Link.select().where(Link.example_id << link_example_ids)
    all_links = [l.id for l in example_links] + invalid_link_ids
    Link.delete().where(Link.id << all_links).execute()
    to_delete_example_ids = list(set([l.example.id for l in example_links]))
    log(
        f"CLEANUP: Trashing {len(to_delete_example_ids)} examples and {len(all_links)} links"
    )
    print(to_delete_example_ids)
    to_delete_examples = Example.select().where(Example.id << to_delete_example_ids)
    trash_examples = [ex.load() for ex in to_delete_examples]
    trash_file = DB.add_to_trash(trash_examples, dataset)
    log(f"CLEANUP: Examples moved to trash: {trash_file}")
    Example.delete().where(Example.id << to_delete_example_ids).execute()
    log(f"CLEANUP: Examples and links removed from database")
    print_stats(
        title="Trash rejected examples",
        no_format=False,
        stats={
            "Dataset": dataset,
            "Removed": len(link_example_ids),
            "Trash": trash_file,
        },
    )

Try it

data.jsonl

{ "text": "1", "label": "TEST", "answer":"reject" }
{ "text": "2", "label": "TEST", "answer":"reject"  }
{ "text": "3", "label": "TEST" }

test_recipe.sh

#!/bin/bash
prodigy dataset test-dataset "test for removing rejected examples"
prodigy db-in test-dataset ./data.jsonl
prodigy cleanup test-dataset -F ./recipe.py
prodigy stats test-dataset

The upside to a custom recipe is that the examples can be added to the prodigy trash before being removed, so you can recover them if need be:

  ✨  Trash rejected examples

Dataset   test-dataset                  
Removed   2                             
Trash     /yourpath/trash/test-dataset.jsonl

nix411 · July 4, 2019, 7:36am

Ah yes of course. I keep forgetting to use recipes for more than just labelling tasks. Thanks @justindujardin

jsnleong · July 4, 2019, 9:54am

Hi,

@nix411, just curious, what might be reason why you would want to delete ‘rejected’ examples? Thinking that a mixture of ‘accepts’ and ‘rejects’ are useful in model’s learning.

nix411 · July 4, 2019, 9:57am

Sure.

I am doing information extraction. Now I am verifying that the information being extracted is correct or not.

accept: use the extracted information to create unit tests for my application.
reject: implement a fix. Rerun the classification on the failing ones.

I hope it makes sense - maybe there is a better workflow though!?

AlejandroJCR · October 9, 2019, 1:10pm

@justindujardin Hi Justin, I tried your scrip to modify the database in place and it works perfect, in the SQLlite database, I'm changing the default background db for a PostgreSQL one and I'm geeting on this line:

Example.delete().where(Example.id << example_ids).execute()

peewee.IntegrityError: update or delete on table "example" violates foreign key constraint "link_example_id_fkey" on table "link"
DETAIL: Key (id)=(435954) is still referenced from table "link".

Any idea on how to sort this around,
Best regards

justindujardin · October 10, 2019, 5:01pm

Hi @AlejandroJCR,

The problem is that PostgreSQL enforces foreign key constraints and SQLite does not (by default) I updated the snippet above to look for and remove links for rejected example sessions as well. Can you try the updated version and confirm it works?

AlejandroJCR · October 11, 2019, 8:01am

Hello Dustin, thank you for your quick response and update, although I still getting the same error, just in case you may ask: the way I connect to the PostgreSQL it's the standart one.

justindujardin · October 12, 2019, 4:55pm

Okay, I reproduced the error (in SQLite by enabling ForeignKey constraints) and fixed it. Please try again

AlejandroJCR · October 14, 2019, 8:44am

I tried now and it works very well. Thanks you very much Dustin.

Topic		Replies	Views
Dropping dataset from code doesn't properly delete examples done , database	12	3194	June 5, 2020
Excluding examples in a new database that are present in another one usage	1	79	May 30, 2024
Old examples are automatically added to new dataset done , database	15	2042	March 25, 2019
Are 'Reject' examples included in textcat_multilabel train/train-curve?	5	248	November 19, 2022
Delete annotation from dataset/database usage , database	1	1859	January 15, 2019

Deleting examples from DB

Try it

Related topics