Exclude for custom_recipes - what am I missing?


I’ve read this thread and searched the forum without any luck. I want to exclude already annotated data when starting up a new annotation session, but I can’t get it to work. Here is my recipe:

This is my first ever recipe and I’m only just recently taken up programming in Python, so my guess is that I’m missing something very obvious.

When starting a new session I run prodigy emne my_database -F custom_recipe.py

                dataset = prodigy.recipe_args['dataset'])
def recipe(dataset):
    filename = 'emne.jsonl'
    stream = custom_stream(filename)
    return {
        'on_load': custom_loader(filename),
        'stream': stream,
        'dataset': dataset,
        'exclude': [dataset],
        'view_id': 'choice'

def custom_stream(filename):
    "Create custom stream"
    options = [{'id': 1, 'text': 'Arbejdsmarked og beskæftigelse'},
               {'id': 2, 'text': 'Børn, unge og familie'},
               {'id': 3, 'text': 'Energi, forsyning og klima'},
               {'id': 4, 'text': 'EU'},
               {'id': 5, 'text': 'Finanssektoren'},
               {'id': 6, 'text': 'Flygtninge og asyl'},
               {'id': 7, 'text': 'Folkekirken og andre trossamfund'},
               {'id': 8, 'text': 'Forskning'},
               {'id': 9, 'text': 'Forsvar og militær'},
               {'id': 10, 'text': 'Integration'},
               {'id': 11, 'text': 'Internationalt samarbejde og handel'},
               {'id': 12, 'text': 'Kultur'},
               {'id': 13, 'text': 'Lokalforvaltning'},
               {'id': 14, 'text': 'Miljø og fødevarer'},
               {'id': 15, 'text': 'Miljø og natur'},
               {'id': 16, 'text': 'Offentlig forvaltning'},
               {'id': 17, 'text': 'Offentlige finanser'},
               {'id': 18, 'text': 'Retspolitik og justitsvæsen'},
               {'id': 19, 'text': 'Rigsfællesskabet'},
               {'id': 20, 'text': 'Skatter og afgifter'},
               {'id': 21, 'text': 'Sociale forhold'},
               {'id': 22, 'text': 'Sundhed'},
               {'id': 23, 'text': 'Teknologi og digitalisering'},
               {'id': 24, 'text': 'Transport og infrstruktur'},
               {'id': 25, 'text': 'Uddannelse'},
               {'id': 26, 'text': 'Udviklingsbistand og nødhjælp'},
               {'id': 27, 'text': 'Vækst og erhverv'}]
    stream = prodigy.components.loaders.JSONL(filename)
    for task in stream:
        task['options'] = options
        yield task

def custom_loader(filename):
    "Create jsonl file with data to annotate. This is only relevant for the first session"
    if os.path.exists(filename):
        connection = MySQLdb.connect(host='host',user='user',passwd='passwd',port=3306,db='my_db')
        cursor = connection.cursor(MySQLdb.cursors.DictCursor)
        query = '''SELECT id, titel, resume, lovnummer from Sag WHERE typeid = 3'''
        records = cursor.fetchall()
        with jsonlines.open(filename, mode='w') as writer:
            for row in records:
                writer.write({'text': ('TITLE:\n'+row['titel']+'\n\nRESUME:\n'+row['resume']), 'meta':{'id':row['id'], 'lovnummer':row['lovnummer']}})

Hey! For a first recipe, this looks great! :+1: I also really like your custom loader solution, and I can’t see anything that looks suspicious. So just to confirm: You ran the recipe, annotated a few examples, saved them to your dataset, restarted the server, and the stream started again at the beginning, with examples you had already annotated in the previous session?

I just had a look at the exclude implementation in Prodigy and I think you might have hit an edge case where the filtering is applied before the hashes are assigned, meaning that the already annotated examples in your set aren’t mapped to the incoming examples correctly. (Sorry about that – I already fixed this for the next release!)

If you look at the already annotated examples…

prodigy db-out my_database | less

… you’ll see that each example in your set has two hashes assigned: an _input_hash (based on the input data) and a _task_hash (based on the input hash and other properties like options, spans etc.). This lets Prodigy find different questions about the same input data, as well as identical questions. The task hashes are used to filter out already annotated examples within the exclude logic.

As a quick workaround, you could simply take care of the hashing within your custom recipe, to ensure that the hashes are always set before the filtering is applied:

from prodigy import set_hashes

# within your custom_stream function
for task in stream:
    task['options'] = options
    task = set_hashes(task)
    yield task

(You can find more details on the set_hashes function in your PRODIGY_README.html.)

1 Like

Thanks Ines!

The workaround solution did indeed solve the issue. I’m not very familiar with hashing so I would never have thought of trying something like that.

1 Like

Yay, glad to hear it’s working and sorry about the confusion! We’re just getting v1.4.1 ready, which will include a fix for this issue, so Prodigy should now handle the exclude logic without ever requiring pre-set hashes.

(And the hashing is definitely something you shouldn’t ever have to think about – unless you explicitly want to. But it might be good to know for the future – for example, if you ever want to export your annotations and quickly filter out all examples that refer to the same input text, but with potentially different options, spans or labels, you can simply check for the _input_hash.)

1 Like

I tried using this exclude function.For one recipe it does not do the desired job, as it starts the annotations from starting in the next session.
Whereas in the other , it shows no tasks available even though the database is empty.

The flow of my recipe is that the annotations from the first recipe are again used by the second recipe.

@akshitasood63 Moved your reply here, because this thread has more details on this. As I mentioned above, try setting the hashes of each task manually to ensure they’re available when Prodigy filters already annotated tasks from your stream. We’re also just working on Prodigy v1.4.1, which will include a fix for this.

1 Like

Just released v1.4.1, which includes a fix for this issue. You should now be able to remove the set_hashes from your custom recipe and rely on Prodigy to handle this properly for you!

1 Like

Hi @ines, I was using the ner.make_gold.py as a template to write a custom NER correction recipe. Essentially it's the same, just a custom UI component added. The make_tasks function assigns a new hash with set_hashes. exclude doesn't appear to work with this recipe and I wonder might set_hashes be the problem here? Would removing it make sense since the update in v1.4.1?

Another issue I have with this recipe is that it has a very long start up time, around 15 minutes. When running Prodigy with logging turned on it appears that tokenizing the examples is taking 13 minutes. Is this to be expected? Totally fine if it is!

04:31:31: RECIPE: Loading recipe from file ../../chart_recipes.py
04:31:32: RECIPE: Calling recipe 'chart.ner.correct'
04:33:20: VALIDATE: Validating components returned by recipe
04:33:20: CONTROLLER: Initialising from recipe
04:33:20: VALIDATE: Creating validator for view ID 'blocks'
04:33:20: VALIDATE: Validating Prodigy and recipe config
04:33:20: DB: Initializing database SQLite
04:33:20: DB: Connecting to database SQLite
04:33:22: DB: Creating dataset '2020-07-29_04-33-20'
04:33:22: CONTROLLER: Initialising from recipe
04:33:22: CONTROLLER: Validating the first batch for session: None
04:33:22: PREPROCESS: Tokenizing examples
04:46:47: CORS: initialized with wildcard "*" CORS origins