Post-labeling train/test split

I often find myself wanting to be able to split train/test sets using information in the meta field of my examples. For instance, I’d like to know how well my models are performing across sources of text (e.g. newswire vs. newspaper vs. wiki vs. blog posts) and I’d like to train on one or two types and then test on a third type. This would really help with deciding whether I should get a greater variety of sources, or just label more of the existing sources. Would this be complicated to add and would other people benefit from it?

I could split them into different datasets but I forget until it’s too late and it’s nice to be able to train using all at once, too.

Thanks, this definitely sounds useful!

In terms of implementation, I’m not sure what the best strategy is… I think the main complication is that users are technically able to structure their task however they like, so adding a command-line option to do complex, arbitrary filtering of nested properties seems kinda hard.

Doing it yourself within the batch-train recipes should be pretty easy, though. All you’d have to do is set --eval-split 0 and then filter the examples yourself based on some condition . This is maybe not the most elegant way, but I hope it gets the idea across:

is_eval = lambda eg: eg['meta']['source'] == 'newswire'
evals = [eg for eg in examples if is_eval(eg)]
examples = [eg for eg in examples if not is_eval(eg)]

Another idea I just thought of: If you don’t mind adding the split sets sets you create to your database, you could also write a script or recipe that does this for you. So you wrap the batch_train function in a custom recipe, and before you execute it, you get the dataset from the database, split the examples based on your condition and add two new sets to the database. Haven’t tested this code yet and maybe it’s a bit too much… but then again, it’s a utility you write once and reuse forever.

import prodigy
from prodigy.recipes.ner import batch_train
from prodigy.db import connect

db = connect()  # connect to database

@prodigy.recipe('ner.batch-train-custom', ...)  # argument annotations here
def batch_train_custom(base_dataset, source_id, ...):  # other arguments here
    examples = db.get_dataset(base_dataset)  # get dataset
    evals = [eg for eg in examples if eg['meta']['source'] == source_id]
    examples = [eg for eg in examples if not eg['meta']['source'] == source_id]

    # name the new dataset, e.g. set_without_newswire and set_eval_newswire
    dataset_name = '{}_without_{}'.format(base_dataset, source_id)
    evalset_name = '{}_eval_{}'.format(base_dataset, source_id)

    # add datasets and add examples to them
    db.add_dataset(dataset_name)
    db.add_examples(examples, datasets=[dataset_name])
    db.add_dataset(evalset_name)
    db.add_examples(evals, datasets=[evalset_name])

    # execute batch_train with new dataset and eval set as eval_id
    return batch_train(dataset_name, eval_id=evalset_name, ...) # other arguments here

And then you could run it as:

prodigy ner.batch-train-custom my_dataset "newswire" ... # other arguments

Btw, only semi-related, but I just noticed how annoying it is to duplicate the recipe arguments and argument annotations for recipes with this many arguments. So I was playing around with getting the arguments from the function and the snippet below seems to work (Plac adds the argument annotations to the .__annotations__ attribute, too, which is super nice). So if this ends up working, maybe Prodigy could expose a helper function for this.

Edit: Okay, it kinda doesn’t work. But something like this might and should, haha.

argnames = batch_train.__code__.co_varnames[:batch_train.__code__.co_argcount]
args = dict(zip(argnames, batch_train.__defaults__))

@prodigy.recipe('custom', **batch_train.__annotations__)
def custom_recipe(**args):
    pass
1 Like

Thanks! I think a custom batch_train recipe is the way to go. I’ll give it a try tomorrow.

This worked great! I didn’t end up making new dataset entries to avoid cluttering my database, which meant that I had to duplicate a lot of the batch-train code into batch-train-custom. I just define examples and eval as you do and then launch right into the copied batch-train code. Once the repo for custom recipes is up, I’ll be happy to share.

1 Like