Feature Request: Option to skip the first N samples

plusepsilon · April 27, 2018, 7:04pm

It would be nice to have a skip option to ignore the first N samples of the data. This is especially useful when we have to restart a recipe with the same dataset (after updating code, crash, etc.)

Something like:

skip = int(os.getenv('SKIP', 0))
if skip > 0:
    for idx, _ in enumerate(stream):
        if idx == skip - 1:
            print('Skipped {} examples from source'.format(skip))
            break

Another option would be to dedupe the examples between DATASET and SOURCE so we don’t duplicate labels (as an option since you might want multiple labels per sample).

Thanks

Topic		Replies	Views
Skip Functionality usage	3	472	September 28, 2022
Feature Request: Number of samples remaining enhancement	2	547	August 1, 2018
Restarting prodigy on same dataset doesn't skip completed tasks (custom recipe)	3	291	October 5, 2022
Undesirable "ignore" examples build up with low quality input streams enhancement	5	1688	September 27, 2022
Are 'Reject' examples included in textcat_multilabel train/train-curve?	5	216	November 19, 2022

Feature Request: Option to skip the first N samples

Related Topics