Problem
I’m experimenting with using Quilt to manage data. The problem is that if I have a text file example.txt
, Quilt stores its contents in another file without the ‘.txt’ extension, such as: example
and gives me the absolute path to the text file:
/Users/hunan/quilt_data/example
I want to use this file in a custom Prodigy recipe to stream it, e.g.:
pgy custom.recept /Users/hunan/quilt_data/example [model] -F recipes/custom.py
But getting the error:
ValueError: No loader found for ''
Even though I explicitly set the loader
to TXT
, get_stream
still complains. Here are the relevant lines in the recipe:
from prodigy.components.loaders import TXT
@recipe(
source=recipe_args['source'],
api=recipe_args['api'],
loader=recipe_args['loader'])
def recept(source, api):
# If Quilt source, then we extract the path
# and explicitly set the loader to TXT
if is_quilt_path(source): # source = 'quilt:/Users/hunan/quilt_data/example'
source = get_abs_path(path) # source = '/Users/hunan/quilt_data/example'
loader = TXT
stream = get_stream(source, api, loader, rehash=True, dedup=True)
But get_stream
doesn’t treat source
as a txt
file even though loader=TXT
.
Question
Is there a way of forcing get_stream
to threat an arbitrary file as a txt file? I could write a custom get_stream
for this use case, but then I’d have to implement rehash
, dedup
, etc:
def txt_loader(path): # no matter what the extension of path is
with open(path) as fp:
for line in fp:
yield {'text': line.strip()}
def get_stream(path, loader=txt_loader):
for obj in loader(path):
yield obj
As always, grateful for your wonderful work. I appreciate any feedback on this.
(Tagging my colleagues: @plusepsilon @soumyagk.)