Exclude for custom_recipes - what am I missing?


(Ronnie Taarnborg) #1


I’ve read this thread and searched the forum without any luck. I want to exclude already annotated data when starting up a new annotation session, but I can’t get it to work. Here is my recipe:

This is my first ever recipe and I’m only just recently taken up programming in Python, so my guess is that I’m missing something very obvious.

When starting a new session I run prodigy emne my_database -F custom_recipe.py

                dataset = prodigy.recipe_args['dataset'])
def recipe(dataset):
    filename = 'emne.jsonl'
    stream = custom_stream(filename)
    return {
        'on_load': custom_loader(filename),
        'stream': stream,
        'dataset': dataset,
        'exclude': [dataset],
        'view_id': 'choice'

def custom_stream(filename):
    "Create custom stream"
    options = [{'id': 1, 'text': 'Arbejdsmarked og beskæftigelse'},
               {'id': 2, 'text': 'Børn, unge og familie'},
               {'id': 3, 'text': 'Energi, forsyning og klima'},
               {'id': 4, 'text': 'EU'},
               {'id': 5, 'text': 'Finanssektoren'},
               {'id': 6, 'text': 'Flygtninge og asyl'},
               {'id': 7, 'text': 'Folkekirken og andre trossamfund'},
               {'id': 8, 'text': 'Forskning'},
               {'id': 9, 'text': 'Forsvar og militær'},
               {'id': 10, 'text': 'Integration'},
               {'id': 11, 'text': 'Internationalt samarbejde og handel'},
               {'id': 12, 'text': 'Kultur'},
               {'id': 13, 'text': 'Lokalforvaltning'},
               {'id': 14, 'text': 'Miljø og fødevarer'},
               {'id': 15, 'text': 'Miljø og natur'},
               {'id': 16, 'text': 'Offentlig forvaltning'},
               {'id': 17, 'text': 'Offentlige finanser'},
               {'id': 18, 'text': 'Retspolitik og justitsvæsen'},
               {'id': 19, 'text': 'Rigsfællesskabet'},
               {'id': 20, 'text': 'Skatter og afgifter'},
               {'id': 21, 'text': 'Sociale forhold'},
               {'id': 22, 'text': 'Sundhed'},
               {'id': 23, 'text': 'Teknologi og digitalisering'},
               {'id': 24, 'text': 'Transport og infrstruktur'},
               {'id': 25, 'text': 'Uddannelse'},
               {'id': 26, 'text': 'Udviklingsbistand og nødhjælp'},
               {'id': 27, 'text': 'Vækst og erhverv'}]
    stream = prodigy.components.loaders.JSONL(filename)
    for task in stream:
        task['options'] = options
        yield task

def custom_loader(filename):
    "Create jsonl file with data to annotate. This is only relevant for the first session"
    if os.path.exists(filename):
        connection = MySQLdb.connect(host='host',user='user',passwd='passwd',port=3306,db='my_db')
        cursor = connection.cursor(MySQLdb.cursors.DictCursor)
        query = '''SELECT id, titel, resume, lovnummer from Sag WHERE typeid = 3'''
        records = cursor.fetchall()
        with jsonlines.open(filename, mode='w') as writer:
            for row in records:
                writer.write({'text': ('TITLE:\n'+row['titel']+'\n\nRESUME:\n'+row['resume']), 'meta':{'id':row['id'], 'lovnummer':row['lovnummer']}})

Custom templates with custom DB and exclude logic
(Ines Montani) #2

Hey! For a first recipe, this looks great! :+1: I also really like your custom loader solution, and I can’t see anything that looks suspicious. So just to confirm: You ran the recipe, annotated a few examples, saved them to your dataset, restarted the server, and the stream started again at the beginning, with examples you had already annotated in the previous session?

I just had a look at the exclude implementation in Prodigy and I think you might have hit an edge case where the filtering is applied before the hashes are assigned, meaning that the already annotated examples in your set aren’t mapped to the incoming examples correctly. (Sorry about that – I already fixed this for the next release!)

If you look at the already annotated examples…

prodigy db-out my_database | less

… you’ll see that each example in your set has two hashes assigned: an _input_hash (based on the input data) and a _task_hash (based on the input hash and other properties like options, spans etc.). This lets Prodigy find different questions about the same input data, as well as identical questions. The task hashes are used to filter out already annotated examples within the exclude logic.

As a quick workaround, you could simply take care of the hashing within your custom recipe, to ensure that the hashes are always set before the filtering is applied:

from prodigy import set_hashes

# within your custom_stream function
for task in stream:
    task['options'] = options
    task = set_hashes(task)
    yield task

(You can find more details on the set_hashes function in your PRODIGY_README.html.)

(Ronnie Taarnborg) #3

Thanks Ines!

The workaround solution did indeed solve the issue. I’m not very familiar with hashing so I would never have thought of trying something like that.

(Ines Montani) #4

Yay, glad to hear it’s working and sorry about the confusion! We’re just getting v1.4.1 ready, which will include a fix for this issue, so Prodigy should now handle the exclude logic without ever requiring pre-set hashes.

(And the hashing is definitely something you shouldn’t ever have to think about – unless you explicitly want to. But it might be good to know for the future – for example, if you ever want to export your annotations and quickly filter out all examples that refer to the same input text, but with potentially different options, spans or labels, you can simply check for the _input_hash.)

(Akshita Sood) #5

I tried using this exclude function.For one recipe it does not do the desired job, as it starts the annotations from starting in the next session.
Whereas in the other , it shows no tasks available even though the database is empty.

The flow of my recipe is that the annotations from the first recipe are again used by the second recipe.

(Ines Montani) #6

@akshitasood63 Moved your reply here, because this thread has more details on this. As I mentioned above, try setting the hashes of each task manually to ensure they’re available when Prodigy filters already annotated tasks from your stream. We’re also just working on Prodigy v1.4.1, which will include a fix for this.

Exclude not working properly
(Ines Montani) #7

Just released v1.4.1, which includes a fix for this issue. You should now be able to remove the set_hashes from your custom recipe and rely on Prodigy to handle this properly for you!