Thanks, this definitely sounds useful!
In terms of implementation, I’m not sure what the best strategy is… I think the main complication is that users are technically able to structure their task however they like, so adding a command-line option to do complex, arbitrary filtering of nested properties seems kinda hard.
Doing it yourself within the batch-train
recipes should be pretty easy, though. All you’d have to do is set --eval-split 0
and then filter the examples yourself based on some condition . This is maybe not the most elegant way, but I hope it gets the idea across:
is_eval = lambda eg: eg['meta']['source'] == 'newswire'
evals = [eg for eg in examples if is_eval(eg)]
examples = [eg for eg in examples if not is_eval(eg)]
Another idea I just thought of: If you don’t mind adding the split sets sets you create to your database, you could also write a script or recipe that does this for you. So you wrap the batch_train
function in a custom recipe, and before you execute it, you get the dataset from the database, split the examples based on your condition and add two new sets to the database. Haven’t tested this code yet and maybe it’s a bit too much… but then again, it’s a utility you write once and reuse forever.
import prodigy
from prodigy.recipes.ner import batch_train
from prodigy.db import connect
db = connect() # connect to database
@prodigy.recipe('ner.batch-train-custom', ...) # argument annotations here
def batch_train_custom(base_dataset, source_id, ...): # other arguments here
examples = db.get_dataset(base_dataset) # get dataset
evals = [eg for eg in examples if eg['meta']['source'] == source_id]
examples = [eg for eg in examples if not eg['meta']['source'] == source_id]
# name the new dataset, e.g. set_without_newswire and set_eval_newswire
dataset_name = '{}_without_{}'.format(base_dataset, source_id)
evalset_name = '{}_eval_{}'.format(base_dataset, source_id)
# add datasets and add examples to them
db.add_dataset(dataset_name)
db.add_examples(examples, datasets=[dataset_name])
db.add_dataset(evalset_name)
db.add_examples(evals, datasets=[evalset_name])
# execute batch_train with new dataset and eval set as eval_id
return batch_train(dataset_name, eval_id=evalset_name, ...) # other arguments here
And then you could run it as:
prodigy ner.batch-train-custom my_dataset "newswire" ... # other arguments
Btw, only semi-related, but I just noticed how annoying it is to duplicate the recipe arguments and argument annotations for recipes with this many arguments. So I was playing around with getting the arguments from the function and the snippet below seems to work (Plac adds the argument annotations to the .__annotations__
attribute, too, which is super nice). So if this ends up working, maybe Prodigy could expose a helper function for this.
Edit: Okay, it kinda doesn’t work. But something like this might and should, haha.
argnames = batch_train.__code__.co_varnames[:batch_train.__code__.co_argcount]
args = dict(zip(argnames, batch_train.__defaults__))
@prodigy.recipe('custom', **batch_train.__annotations__)
def custom_recipe(**args):
pass