I have an iterative workflow process where I need to delete examples that were rejected. Is it possible to easily delete those examples from the DB or should I do the following?
db-out
- filter on accepted
- drop and create dataset
db-in
I have an iterative workflow process where I need to delete examples that were rejected. Is it possible to easily delete those examples from the DB or should I do the following?
db-out
db-in
Your approach seems fine, or you could write a custom recipe to do it:
recipe.py
from typing import List, Dict
import prodigy
from prodigy.components.db import Database, Dataset, Example, Link
from prodigy.util import log, print_stats
@prodigy.recipe("cleanup")
def remove_rejected_examples(dataset: str):
DB: Database = prodigy.components.db.connect()
if dataset not in DB:
raise ValueError(f"dataset {dataset} does not exist!")
dataset_id = Dataset.get(Dataset.name == dataset).id
links = Link.select(Link.example).where(Link.dataset == dataset_id)
to_delete: List[Link] = []
invalid_link_ids: List[int] = []
for link in links:
try:
content = link.example.load()
if content["answer"] == "reject":
to_delete.append(link)
except Example.DoesNotExist:
# If we find a broken link, remove it
invalid_link_ids.append(link.id)
# Grab ALL the links for the examples we want to remove, and
# see how many references there are to each example. If there
# are only two, we'll remove the example along with its links.
links = Link.select().where(Link.example_id << [l.example.id for l in to_delete])
link_counts: Dict[str, int] = {}
for link in links:
key = link.example_id
if key not in link_counts:
link_counts[key] = 0
link_counts[key] += 1
# If there are two or fewer links to this example it's okay to remove it.
link_example_ids = [k for k, v in link_counts.items() if v <= 2]
example_links = Link.select().where(Link.example_id << link_example_ids)
all_links = [l.id for l in example_links] + invalid_link_ids
Link.delete().where(Link.id << all_links).execute()
to_delete_example_ids = list(set([l.example.id for l in example_links]))
log(
f"CLEANUP: Trashing {len(to_delete_example_ids)} examples and {len(all_links)} links"
)
print(to_delete_example_ids)
to_delete_examples = Example.select().where(Example.id << to_delete_example_ids)
trash_examples = [ex.load() for ex in to_delete_examples]
trash_file = DB.add_to_trash(trash_examples, dataset)
log(f"CLEANUP: Examples moved to trash: {trash_file}")
Example.delete().where(Example.id << to_delete_example_ids).execute()
log(f"CLEANUP: Examples and links removed from database")
print_stats(
title="Trash rejected examples",
no_format=False,
stats={
"Dataset": dataset,
"Removed": len(link_example_ids),
"Trash": trash_file,
},
)
data.jsonl
{ "text": "1", "label": "TEST", "answer":"reject" }
{ "text": "2", "label": "TEST", "answer":"reject" }
{ "text": "3", "label": "TEST" }
test_recipe.sh
#!/bin/bash
prodigy dataset test-dataset "test for removing rejected examples"
prodigy db-in test-dataset ./data.jsonl
prodigy cleanup test-dataset -F ./recipe.py
prodigy stats test-dataset
The upside to a custom recipe is that the examples can be added to the prodigy trash before being removed, so you can recover them if need be:
✨ Trash rejected examples
Dataset test-dataset
Removed 2
Trash /yourpath/trash/test-dataset.jsonl
Ah yes of course. I keep forgetting to use recipes for more than just labelling tasks. Thanks @justindujardin
Hi,
@nix411, just curious, what might be reason why you would want to delete ‘rejected’ examples? Thinking that a mixture of ‘accepts’ and ‘rejects’ are useful in model’s learning.
Sure.
I am doing information extraction. Now I am verifying that the information being extracted is correct or not.
accept
: use the extracted information to create unit tests for my application.reject
: implement a fix. Rerun the classification on the failing ones.I hope it makes sense - maybe there is a better workflow though!?
@justindujardin Hi Justin, I tried your scrip to modify the database in place and it works perfect, in the SQLlite database, I'm changing the default background db for a PostgreSQL one and I'm geeting on this line:
Example.delete().where(Example.id << example_ids).execute()
peewee.IntegrityError: update or delete on table "example" violates foreign key constraint "link_example_id_fkey" on table "link"
DETAIL: Key (id)=(435954) is still referenced from table "link".
Any idea on how to sort this around,
Best regards
Hi @AlejandroJCR,
The problem is that PostgreSQL enforces foreign key constraints and SQLite does not (by default) I updated the snippet above to look for and remove links for rejected example sessions as well. Can you try the updated version and confirm it works?
Hello Dustin, thank you for your quick response and update, although I still getting the same error, just in case you may ask: the way I connect to the PostgreSQL it's the standart one.
Okay, I reproduced the error (in SQLite by enabling ForeignKey constraints) and fixed it. Please try again
I tried now and it works very well. Thanks you very much Dustin.