Dataset from mysql instead json in 2022

How can feed a textcat annotation project from mysql or an API Rest instead from a static json file?


Hi! You can always write an entirely custom loader that fetches data from your MySQL database or REST API and yields out dictionaries in Prodigy's format (e.g. {"text": "..."}). You can either integrate it via a custom recipe, or make it a separate script that writes out the dictionaries and then pipe the output forward to any recipe on the command line. See here for examples:

awesome @ines
has some example to pass a dinamic stream to the images loader?
on the image.manual recipe had this line,
stream = Images(source)
but if source is the result of a query "select x,z from table limit 0,50", how can I trigger a new load of the source?


Are you using a custom recipe? In that case, you can also just modify the recipe itself and add a custom stream generator that loads from your database. Of course, the specific implementation will depend on what your database query returns, but it'll roughly look like this:

def custom_stream():
    data = make_your_database_query_here()  # query your db
    for image_url in data:
        yield {"image": image_url}

# in your recipe
stream = custom_stream()

If you're only loading some queries at a time or you want to make request to a paginated API, you can also do something like this and keep incrementing the page/count/whatever until no data is available anymore:

page = 0
while True:
    data = make_your_database_query_here(page)
    for image_url in data:
        yield {"image": image_url}
    page += 1

If you're using the built-in image.manual recipe, it will be able to also read from standard input because it uses Prodigy's get_stream helper instead of Images to load the stream. So you can have a script that loads your data and writes it to standard output:

data = make_your_database_query_here()
for image_url in data:
        print({"image": image_url})

You can then pipe the output forward to the recipe and set the source to - to read from stdin:

python | prodigy image.manual dataset - --label FOO,BAR

awesome, you always explaining all the details.