Custom Stream / API

Hi,

I have a generator that will generate examples for me for ner.make-gold etc. I wrapped the make-gold and teach recipes and everything seems to work great, with the exception that it looks like the generator is not just yielding for each example, but is loading quite a few examples and consuming them fore I even get to annotate the first example. I was hoping the generator would just be executed at runtime for each example due to rate limiting of the api its based on. (if not being totally exhaused before I see the first example).

Is there any way to do that? To not have the generator consumed entirely before the first page loads in prodigy? How are the bundled API’s build like the NYT etc? Surely they can’t load everything first and then go from there.

Thanks!

Comron

Yes, that's definitely possible and one of the big advantages of using generators for the streams! When you start the server, Prodigy will get the first batch of tasks from the stream (by default, 10 tasks).

If you're reading data from a file, one important requirement is that you need to be able to load it in a way that doesn't require the whole file to be consumed first. For example, to read in JSON, you need to parse the entire file upfront, which is bad – JSONL or plain text on the other hand can be read line by line.

If you're loading from an API, you usually want to start by fetching the first page and fetch more once the data is exhausted. Here's a pseudocode example, which is very similar to how Prodigy's API loaders are doing this (some of the specifics vary depending on the API):

def api_loader():
    page = 0
    while True:
        r = requests.get('http://your-api.com?page={}'.format(page))
        response = r.json()
        for item in response:
            task = {'text': item['text']}
            yield task
        page += 1

If your generator is still consumed entirely (or it seems like too much of it is consumed), here are some possible explanations and tips for debugging:

  • Maybe your code accidentally consumes the generator by converting it to a list (or accidentally using a list comprehension or returning something instead of yielding it). It's an easy mistake to make and has happened to me before. If you're using generator helpers like itertoolz, some of those might also consume your stream, so this might be worth checking, too.
  • If you're using an active learning powered recipe, which selects examples from the stream, it's possible that Prodigy just needs to consume more examples in order to fill up the first batch. Sometimes, this means that the data doesn't have enough relevant examples and/or that the model just isn't predicting the label very often (if you're training a new category). As a solution, you can try using different data, or adding match patterns to select more candidates.