Exclude for custom_recipes - what am I missing?

ronnie · March 13, 2018, 10:35am

Hi.

I’ve read this thread and searched the forum without any luck. I want to exclude already annotated data when starting up a new annotation session, but I can’t get it to work. Here is my recipe:

This is my first ever recipe and I’m only just recently taken up programming in Python, so my guess is that I’m missing something very obvious.

When starting a new session I run prodigy emne my_database -F custom_recipe.py

@prodigy.recipe('emne',
                dataset = prodigy.recipe_args['dataset'])
def recipe(dataset):
    ''''''
    filename = 'emne.jsonl'
    stream = custom_stream(filename)
    return {
        'on_load': custom_loader(filename),
        'stream': stream,
        'dataset': dataset,
        'exclude': [dataset],
        'view_id': 'choice'
    }

def custom_stream(filename):
    "Create custom stream"
    options = [{'id': 1, 'text': 'Arbejdsmarked og beskæftigelse'},
               {'id': 2, 'text': 'Børn, unge og familie'},
               {'id': 3, 'text': 'Energi, forsyning og klima'},
               {'id': 4, 'text': 'EU'},
               {'id': 5, 'text': 'Finanssektoren'},
               {'id': 6, 'text': 'Flygtninge og asyl'},
               {'id': 7, 'text': 'Folkekirken og andre trossamfund'},
               {'id': 8, 'text': 'Forskning'},
               {'id': 9, 'text': 'Forsvar og militær'},
               {'id': 10, 'text': 'Integration'},
               {'id': 11, 'text': 'Internationalt samarbejde og handel'},
               {'id': 12, 'text': 'Kultur'},
               {'id': 13, 'text': 'Lokalforvaltning'},
               {'id': 14, 'text': 'Miljø og fødevarer'},
               {'id': 15, 'text': 'Miljø og natur'},
               {'id': 16, 'text': 'Offentlig forvaltning'},
               {'id': 17, 'text': 'Offentlige finanser'},
               {'id': 18, 'text': 'Retspolitik og justitsvæsen'},
               {'id': 19, 'text': 'Rigsfællesskabet'},
               {'id': 20, 'text': 'Skatter og afgifter'},
               {'id': 21, 'text': 'Sociale forhold'},
               {'id': 22, 'text': 'Sundhed'},
               {'id': 23, 'text': 'Teknologi og digitalisering'},
               {'id': 24, 'text': 'Transport og infrstruktur'},
               {'id': 25, 'text': 'Uddannelse'},
               {'id': 26, 'text': 'Udviklingsbistand og nødhjælp'},
               {'id': 27, 'text': 'Vækst og erhverv'}]
    stream = prodigy.components.loaders.JSONL(filename)
    for task in stream:
        task['options'] = options
        yield task

def custom_loader(filename):
    "Create jsonl file with data to annotate. This is only relevant for the first session"
    if os.path.exists(filename):
        pass
    else:
        connection = MySQLdb.connect(host='host',user='user',passwd='passwd',port=3306,db='my_db')
        connection.set_character_set('utf8mb4')
        cursor = connection.cursor(MySQLdb.cursors.DictCursor)
        query = '''SELECT id, titel, resume, lovnummer from Sag WHERE typeid = 3'''
        cursor.execute(query)
        records = cursor.fetchall()
        with jsonlines.open(filename, mode='w') as writer:
            for row in records:
                writer.write({'text': ('TITLE:\n'+row['titel']+'\n\nRESUME:\n'+row['resume']), 'meta':{'id':row['id'], 'lovnummer':row['lovnummer']}})
        connection.close()

ines · March 13, 2018, 11:16am

Hey! For a first recipe, this looks great! I also really like your custom loader solution, and I can’t see anything that looks suspicious. So just to confirm: You ran the recipe, annotated a few examples, saved them to your dataset, restarted the server, and the stream started again at the beginning, with examples you had already annotated in the previous session?

I just had a look at the exclude implementation in Prodigy and I think you might have hit an edge case where the filtering is applied before the hashes are assigned, meaning that the already annotated examples in your set aren’t mapped to the incoming examples correctly. (Sorry about that – I already fixed this for the next release!)

If you look at the already annotated examples…

prodigy db-out my_database | less

… you’ll see that each example in your set has two hashes assigned: an _input_hash (based on the input data) and a _task_hash (based on the input hash and other properties like options, spans etc.). This lets Prodigy find different questions about the same input data, as well as identical questions. The task hashes are used to filter out already annotated examples within the exclude logic.

As a quick workaround, you could simply take care of the hashing within your custom recipe, to ensure that the hashes are always set before the filtering is applied:

from prodigy import set_hashes

# within your custom_stream function
for task in stream:
    task['options'] = options
    task = set_hashes(task)
    yield task

(You can find more details on the set_hashes function in your PRODIGY_README.html.)

ronnie · March 13, 2018, 12:17pm

Thanks Ines!

The workaround solution did indeed solve the issue. I’m not very familiar with hashing so I would never have thought of trying something like that.

ines · March 13, 2018, 12:27pm

Yay, glad to hear it’s working and sorry about the confusion! We’re just getting v1.4.1 ready, which will include a fix for this issue, so Prodigy should now handle the exclude logic without ever requiring pre-set hashes.

(And the hashing is definitely something you shouldn’t ever have to think about – unless you explicitly want to. But it might be good to know for the future – for example, if you ever want to export your annotations and quickly filter out all examples that refer to the same input text, but with potentially different options, spans or labels, you can simply check for the _input_hash.)

akshitasood63 · March 14, 2018, 12:40pm

I tried using this exclude function.For one recipe it does not do the desired job, as it starts the annotations from starting in the next session.
Whereas in the other , it shows no tasks available even though the database is empty.

The flow of my recipe is that the annotations from the first recipe are again used by the second recipe.

ines · March 14, 2018, 4:25pm

@akshitasood63 Moved your reply here, because this thread has more details on this. As I mentioned above, try setting the hashes of each task manually to ensure they’re available when Prodigy filters already annotated tasks from your stream. We’re also just working on Prodigy v1.4.1, which will include a fix for this.

ines · March 26, 2018, 2:51pm

Just released v1.4.1, which includes a fix for this issue. You should now be able to remove the set_hashes from your custom recipe and rely on Prodigy to handle this properly for you!

1danjordan · July 29, 2020, 10:10am

Hi @ines, I was using the ner.make_gold.py as a template to write a custom NER correction recipe. Essentially it's the same, just a custom UI component added. The make_tasks function assigns a new hash with set_hashes. exclude doesn't appear to work with this recipe and I wonder might set_hashes be the problem here? Would removing it make sense since the update in v1.4.1?

github.com

explosion/prodigy-recipes/blob/63057df56b1e63f4dc91df6da0469d71babfd561/ner/ner_make_gold.py#L10-L36


def make_tasks(nlp, stream, labels):
    """Add a 'spans' key to each example, with predicted entities."""
    # Process the stream using spaCy's nlp.pipe, which yields doc objects.
    # If as_tuples=True is set, you can pass in (text, context) tuples.
    texts = ((eg["text"], eg) for eg in stream)
    for doc, eg in nlp.pipe(texts, as_tuples=True):
        task = copy.deepcopy(eg)
        spans = []
        for ent in doc.ents:
            # Continue if predicted entity is not selected in labels
            if labels and ent.label_ not in labels:
                continue
            # Create span dict for the predicted entitiy
            spans.append(
                {
                    "token_start": ent.start,
                    "token_end": ent.end - 1,
                    "start": ent.start_char,
                    "end": ent.end_char,
                    "text": ent.text,

This file has been truncated. show original

Another issue I have with this recipe is that it has a very long start up time, around 15 minutes. When running Prodigy with logging turned on it appears that tokenizing the examples is taking 13 minutes. Is this to be expected? Totally fine if it is!

04:31:31: RECIPE: Loading recipe from file ../../chart_recipes.py
04:31:32: RECIPE: Calling recipe 'chart.ner.correct'
04:33:20: VALIDATE: Validating components returned by recipe
04:33:20: CONTROLLER: Initialising from recipe
04:33:20: VALIDATE: Creating validator for view ID 'blocks'
04:33:20: VALIDATE: Validating Prodigy and recipe config
04:33:20: DB: Initializing database SQLite
04:33:20: DB: Connecting to database SQLite
04:33:22: DB: Creating dataset '2020-07-29_04-33-20'
04:33:22: CONTROLLER: Initialising from recipe
04:33:22: CONTROLLER: Validating the first batch for session: None
04:33:22: PREPROCESS: Tokenizing examples
04:46:47: CORS: initialized with wildcard "*" CORS origins

Thanks,
Dan

Topic		Replies	Views
Multi-user sessions and excluding annotations by session enhancement , usage , streams	7	1679	December 25, 2019
Continue to annotate same data in new session enhancement , done	19	4003	October 5, 2018
Exclude flag in custom recipe not excluding examples usage , solved	2	476	November 30, 2020
Exclude not functioning / duplicate tasks done , streams	6	1694	July 21, 2020
Custom templates with custom DB and exclude logic usage , custom , solved	20	3056	January 29, 2018

Exclude for custom_recipes - what am I missing?

Related topics