Using get_stream with TXT loader on paths without .txt extension

Problem

I’m experimenting with using Quilt to manage data. The problem is that if I have a text file example.txt, Quilt stores its contents in another file without the ‘.txt’ extension, such as: example and gives me the absolute path to the text file:

  • /Users/hunan/quilt_data/example

I want to use this file in a custom Prodigy recipe to stream it, e.g.:

  • pgy custom.recept /Users/hunan/quilt_data/example [model] -F recipes/custom.py

But getting the error:

ValueError: No loader found for ''

Even though I explicitly set the loader to TXT, get_stream still complains. Here are the relevant lines in the recipe:


from prodigy.components.loaders import TXT

@recipe(
    source=recipe_args['source'],
    api=recipe_args['api'],
    loader=recipe_args['loader'])
def recept(source, api):

    # If Quilt source, then we extract the path
    # and explicitly set the loader to TXT
    if is_quilt_path(source):         # source = 'quilt:/Users/hunan/quilt_data/example'
        source = get_abs_path(path)   # source = '/Users/hunan/quilt_data/example'
        loader = TXT

    stream = get_stream(source, api, loader, rehash=True, dedup=True)    

But get_stream doesn’t treat source as a txt file even though loader=TXT.


Question

Is there a way of forcing get_stream to threat an arbitrary file as a txt file? I could write a custom get_stream for this use case, but then I’d have to implement rehash, dedup, etc:

def txt_loader(path):  # no matter what the extension of path is
    with open(path) as fp:
        for line in fp:
            yield {'text': line.strip()}

def get_stream(path, loader=txt_loader):
    for obj in loader(path):
        yield obj

As always, grateful for your wonderful work. I appreciate any feedback on this.

(Tagging my colleagues: @plusepsilon @soumyagk.)

Resolved! @soumyagk suggested setting loader to 'txt' instead of the loader TXT and that worked wonderfully.

Solution

# instead of TXT (from prodigy.components.loaders)
loader = 'txt'  

Thanks for the great post and for updating with your solution! :+1: Glad it’s all working now!