Filter already annotated text


I built a custom recipe to do text classification according to a query. The dataset is a big CSV file. When I start a new session I basically start from the beginning. I tried to use the filter_inputs to filter the inputs saved from the last session but I still have the same problem.

    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a CSV file", "positional", None, str)
def semanticsearch(
        dataset: str,
        source: str):

    db = connect()
    input_hashes = db.get_input_hashes(dataset)

    stream = CSV(source)
    stream = filter_inputs(stream, input_hashes)

    blocks = [
        {"view_id": "html",
         "html_template": "<div style='background-color:SlateBlue;'><h1 style='color:White;'>{{label}}</h1></div>"},
        {"view_id": "html", "html_template": "<div>{{text}}</div>"}

    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "config": {"blocks": blocks}

Any help?


Hi! Is it possible that the hashes somehow change between the runs? I don't immediately see anything in your code that indicates that (like a timestamp in the main content or something like that). But it's definitely something to check.

The way you're using filter_inputs here won't work if your stream doesn't yet include hashes. So you either want to call prodigy.set_hashes on all your incoming examples before filtering, or write out the logic explicitly like this:

def filter_stream(stream, input_hashes):
    for eg in stream:
        eg = prodigy.set_hashes(eg)
        if eg["_input_hash"] not in input_hashes:
            yield eg

stream = filter_stream(stream, input_hashes)

This also makes it easy to print the hashes if you need to, to double-check if they correctly reflect the "uniqueness" of the content.

Hi Ines!

It worked with no problems!

Thanks for the help :+1: