It would be nice to have a skip option to ignore the first N samples of the data. This is especially useful when we have to restart a recipe with the same dataset (after updating code, crash, etc.)
Something like:
skip = int(os.getenv('SKIP', 0))
if skip > 0:
for idx, _ in enumerate(stream):
if idx == skip - 1:
print('Skipped {} examples from source'.format(skip))
break
Another option would be to dedupe the examples between DATASET
and SOURCE
so we don’t duplicate labels (as an option since you might want multiple labels per sample).
Thanks