Image classification (choice) - Duplicated images

In the "controller", so after your recipe function was executed and has returned its components, and before Prodigy starts up the annotation server.

Yes, absolutely. The entire task dictionary will be saved in the database, and you can get all existing annotations for a given dataset in the database. Let's say your examples look like this:

{"text": "Hello world", "meta": {"id": 123}}
{"text": "Blah blah", "meta": {"id": 456}}

When you annotate them, they'll be saved to the dataset. In your recipe, you can then call db.get_dataset to load them and get the meta.id field from each examples. You now have a list of values that you can compare the incoming examples against.

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset(dataset)
# Get the meta.id field for each example
meta_ids = [eg["meta"]["id"] for eg in examples]

def filter_stream(stream):
    for eg in stream:
        if eg["meta"]["id"] not in meta_ids:
            yield eg

If you can express it in Python, you can pretty much add any conditional logic here. It's probably not very useful, but you could even send an example out if its text is longer than X characters, of if it was annotated before but rejected and its ID is Y and some other custom meta property is Z. Or you could send a certain example out only if today is Monday or Tuesday :sweat_smile:

1 Like