Filter already annotated text

joaomsimoes · December 23, 2021, 12:15pm

Hallo,

I built a custom recipe to do text classification according to a query. The dataset is a big CSV file. When I start a new session I basically start from the beginning. I tried to use the filter_inputs to filter the inputs saved from the last session but I still have the same problem.

@prodigy.recipe(
    "semanticsearch",
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a CSV file", "positional", None, str)
)
def semanticsearch(
        dataset: str,
        source: str):

    db = connect()
    input_hashes = db.get_input_hashes(dataset)

    stream = CSV(source)
    stream = filter_inputs(stream, input_hashes)

    blocks = [
        {"view_id": "html",
         "html_template": "<div style='background-color:SlateBlue;'><h1 style='color:White;'>{{label}}</h1></div>"},
        {"view_id": "html", "html_template": "<div>{{text}}</div>"}
    ]

    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "config": {"blocks": blocks}
    }

Any help?

LG

ines · December 24, 2021, 1:09pm

Hi! Is it possible that the hashes somehow change between the runs? I don't immediately see anything in your code that indicates that (like a timestamp in the main content or something like that). But it's definitely something to check.

The way you're using filter_inputs here won't work if your stream doesn't yet include hashes. So you either want to call prodigy.set_hashes on all your incoming examples before filtering, or write out the logic explicitly like this:

def filter_stream(stream, input_hashes):
    for eg in stream:
        eg = prodigy.set_hashes(eg)
        if eg["_input_hash"] not in input_hashes:
            yield eg

stream = filter_stream(stream, input_hashes)

This also makes it easy to print the hashes if you need to, to double-check if they correctly reflect the "uniqueness" of the content.

joaomsimoes · December 27, 2021, 7:02am

Hi Ines!

It worked with no problems!

Thanks for the help

Topic		Replies	Views
filter_inputs still causes duplicated image usage , image , streams	9	1090	December 3, 2020
Continue to annotate same data in new session enhancement , done	19	4005	October 5, 2018
Duplicate tasks when starting a new session usage , custom	1	762	May 1, 2019
Avoid restarting from zero... enhancement , usage , solved	19	1983	May 10, 2018
Filtering previously annotated images not working image , solved	2	324	August 31, 2022

Filter already annotated text

Related topics