Feature request: directories/archives of text files as a source format

In addition to jsonl, csv etc., have Prodigy accept a directory containing text files as input, one example per file. Equivalently, support a .tgz filetype.

3 Likes

Hi, can you please add support for gzip files? I usually store dataset in that format but have to decompress them when to use with Prodigy.

Sure, do you mean gzipped jsonl, for the loading?

Yes. and text also.

Merged these two threads, since they're related! Happy to add this feature in a future release – in the meantime, I've outlined a simple custom loader solution in this thread:

So instead of decompressing the file, you could also write a simple script that does this for you, loads the individual files and outputs the examples.

1 Like