Prodigy stream buffering behavior

I am using a custom recipe that applies two models to records before yielding them in the preprocess() function. The app seems to buffer >30 records (2.5 minutes) before the Uvicorn server starts serving.

The recipe converts a zeroshot GPT3.5 model into a fewshot one, so I don't necessarily need a ton of examples but do need to customize them for a lot of data sets. Is there some way to drop the target buffer queue depth so I can bring my boot latency down? IMO it would be fine to start the app with a single record buffer.

Verbose logs were:

18:31:26: INIT: Setting all logging levels to 10
18:31:26: CLI: Importing file /opt/extraxion/prodigy_utils/
18:31:26: CLI: Importing file /opt/extraxion/spacy_utils/
18:31:26: RECIPE: Calling recipe 'rel.assisted'
ℹ RECIPE: Starting recipe rel.assisted
{'dataset': 'dataset_out', 'ner_model': '/opt/models/ner', 'rel_model':
'/opt/models/rel', 'source': 'records.jsonl', 'loader': None, 'exclude': None,
'wrap': False, 'hide_arrow_heads': False, 'ner_label_component': 'llm',
'rel_label_component': 'llm', 'label': None, 'span_label': None, 'threshold':
18:31:26: LOADER: Using file extension 'jsonl' to find loader

18:31:26: LOADER: Loading stream from jsonl
18:31:26: LOADER: Rehashing stream
18:31:27: CONFIG: Using config from global prodigy.json

18:31:27: CONFIG: Using config from working dir

18:31:27: CONTROLLER: Initialising from recipe
{'before_db': None, 'config': {'lang': 'en', 'labels': ['A_LABEL', 'A_COOL'], 'relations_span_labels': ['A','COOL','LABEL'], 'exclude_by': 'input', 'wrap_relations': False, 'custom_theme': {'cardMaxWidth': '90%'}, 'hide_relation_arrow': False, 'auto_count_stream': True, 'dataset': 'dataset_out', 'recipe_name': 'rel.assisted', 'batch_size': 1, 'history_size': 0, 'buttons': ['accept', 'ignore', 'undo'], 'feed_overlap': False, 'swipe': False, 'swipe_gestures': {'left': 'accept', 'right': 'ignore'}, 'validate': False, 'instant_submit': True}, 'dataset': 'dataset_out', 'db': True, 'exclude': None, 'get_session_id': None, 'metrics': None, 'on_exit': None, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0xffff7e086e60>, 'self': <prodigy.core.Controller object at 0xffff7e06abf0>, 'stream': <generator object assisted.<locals>.preprocess_stream at 0xffff7dc5b530>, 'update': <function assisted.<locals>.make_update at 0xffff7e07e710>, 'validate_answer': <function assisted.<locals>.validate_answer at 0xffff7e07f1c0>, 'view_id': 'relations'}

18:31:27: CONFIG: Using config from global prodigy.json

18:31:27: CONFIG: Using config from working dir

18:31:27: DB: Initializing database SQLite
18:31:27: DB: Connecting to database SQLite
18:31:27: DB: Creating dataset 'dataset_out'
Added dataset dataset_out to database SQLite.
18:31:27: DB: Creating dataset '2023-11-02_18-31-27'
{'created': datetime.datetime(2023, 11, 2, 18, 31, 27)}

18:31:27: FEED: Initializing from controller
{'auto_count_stream': True, 'batch_size': 1, 'dataset': 'dataset_out', 'db': <prodigy.components.db.Database object at 0xffff7df900d0>, 'exclude': ['dataset_out'], 'exclude_by': 'input', 'max_sessions': 10, 'overlap': False, 'self': <prodigy.components.feeds.Feed object at 0xffff7d85ef20>, 'stream': <generator object assisted.<locals>.preprocess_stream at 0xffff7dc5b530>, 'target_total_annotated': None, 'timeout_seconds': 3600, 'total_annotated': 0, 'total_annotated_by_session': Counter(), 'validator': None, 'view_id': 'relations'}

before the for
18:31:27: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0xffff7e070cc0>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0xffff7f4dfd30>>, 'warn_threshold': 0.4}

18:31:27: FILTER: Filtering out empty examples for key 'text'
sample 0
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
sample 7
sample 8
sample 9
sample 10
sample 11
sample 12
sample 13
sample 14
sample 15
sample 16
sample 17
sample 18
sample 19
sample 20
sample 21
sample 22
sample 23
sample 24
sample 25
sample 26
sample 27
sample 28
sample 29
sample 30
sample 31
sample 32
sample 33
sample 34
18:33:37: STREAM: Counting iterator exceeded timeout of 10 seconds after 3 tasks
sample 35
18:33:37: CORS: initialized with wildcard "*" CORS origins

✨  Starting the web server at ...
Open the app in your browser and start annotating!

INFO:     Started server process [15]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on (Press CTRL+C to quit)
sample 36
sample 37
sample 38
18:33:50: STREAM: Counting iterator exceeded timeout of 10 seconds after 7 tasks

"before the for" is a print statement from just before the preprocess() generator's for loop
f"sample {i}" is a print statement just inside the generator's for loop

So, the stream yields >30 documents before prodigy starts the Uvicorn server (I've seen 35 to 46), yields a few more after the server starts, and then it stops and waits for me to label some before it does more.

From searching through the forum for similar cases:

  • I observe this regardless of batch_size
  • I observe this regardless of validate or feed_overlap
  • I observe this with both a 100 record input file and a 5000+ record input file.
  • I don't observe this as a problem with a 3 record input file (of course)
  • prodigy version 1.11.14
  • pydantic version 1.10.13

Thanks for any help!

It might help to see the custom recipe code. Are you using spacy-llm? In general, when dealing with external models, I really prefer to download batches ahead of time as much as possible. Timeouts happen a lot, that's something that's hard to control.

Is that an option for you? I could also dive deeper into the code if that's preferable, but figured I'd check.

Yes, I am using spacy-llm. In order to deal with timeouts / quota exceeded, I basically set super forgiving backoff / retry behavior. I am using prodigy more for fewshot example tuning than for the more traditional "labeling a thousand examples" use case, which is why the startup behavior matters to me.

The recipe is 430 lines long so I could probably DM it but it's too much for a forum. The gist is that I apply a NER model to label entities and then a REL model to label relations, both backed by spacy-llm. It renders similarly to rel.manual, but with the entities/relations pre-labeled by the zeroshot models. As results come in, I add them to the models as examples in make_update(), which turns them into fewshot. Prompt size is limited, so the first few examples (10?) are the most valuable.

If stream prebuffering was less aggressive, theoretically I could see the impact of the labeled examples very quickly. As it is, I spend 2+ minutes waiting for the queries to load and then churn through the 30+ prebuffered zeroshot results before the first fewshot model results are seen. This is also why building a mechanism to buffer zeroshot results out of band isn't really a useful solution for me. If you imagine a batch_size of ~1, I also wouldn't really have to worry about OpenAI quotas due to the latency added by the prodigy HIL design.

The gist is that I apply a NER model to label entities and then a REL model to label relations, both backed by spacy-llm

Are these both defined in the same config file? Or are you dealing with two separate nlp pipelines here?

One thing just to be sure, are you using the cache in spacy-llm?

Once the LLM examples are on the cache they shouldn't really cause a delay. If that is the case I'd love to hear it because that might imply a bug on our end.

Also, if you have a minimum reproducible example then I can try to run it locally and see if I spot anything.

They are defined in separate config files. They are applied serially as they would be if they were in the same config file, except that we don't apply the REL model if there are <2 entities.

I have put together what is probably an MRE, but OpenAI is down right now so I'll post it a little later after I have tested it and verified that it still behaves as expected.

Looks like I cannot directly upload the MVE here due to extension limits, but here is a google drive link.

It has a explaining how I imagined it being used and showing the problem (buffering more examples than batch_size.) I decided that including two models was unnecessary for the MVE. Really even spacy-llm is unnecessary to show the undesireable buffering behavior, but using it gives some appreciation for the latency.

I have requested access. Let me know once I'm able to download the code.

Should be able to! Sorry, I thought it was public already

Sorry for the delay, the flu got to me which is why I'm only responding now.

When I run your code then it takes about 12s for me to see this output.

dotenv run -- python -m prodigy ner.example xxx model.cfg records.jsonl -F
ℹ RECIPE: Starting recipe ner.example
{'dataset': 'xxx', 'ner_model': 'model.cfg', 'source': 'records.jsonl',
'loader': None, 'exclude': None, 'wrap': False, 'hide_arrow_heads': False,
'ner_label_component': 'llm', 'span_label': None}
ℹ ======== Starting the for Loop ========
ℹ Record 0 yielded
ℹ Record 1 yielded
ℹ Record 2 yielded
ℹ Record 3 yielded
ℹ Record 4 yielded
ℹ Record 5 yielded
ℹ Record 6 yielded
ℹ Record 7 yielded
ℹ Record 8 yielded
ℹ Record 9 yielded

That's still not fast, but nowhere near the two minute mark. Getting to 16 takes another 8s or so and the server boots just fine on my end. However, in your current setup it would take as much time again if I were to restart the server. So I've made a change to your config.cfg file by adding a cache at the end.

examples = null

lang = "en"
pipeline = ["llm"]


factory = "llm"

@llm_tasks = "spacy.NER.v3"
labels = ["FOOD", "ANIMAL"]
description = "You are trying to identify food or animal entities in the text."

FOOD = "Something edible without further processing. e.g. bacon, tapioca, nut, cake"
ANIMAL = "A living creature, but not a plant. e.g. fish, dog, parrot, pig"

@llm_models = "spacy.GPT-3-5.v2"
max_tries = 8
max_request_time = 120

@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 3
max_batches_in_mem = 10

Reloading the server now is much faster!

So I figured I'd rest here and check, if you add such a cache ... does your situation then become manage-able again?

Small extra question: is there a reason why you're running Prodigy v.1.11.4 here?

Sorry to hear about your illness and no problem on the delay... we're in that sick season again!

I'm pretty surprised that the behavior doesn't replicate for you. Perhaps my OpenAI API key is pushing up against the API limits or their recent DDOS behavior started a bit earlier than they claimed.

Using cacheing doesn't work for me for a few reasons:

  1. Part of the reason I don't want to stream many examples is not latency-related. I'm improving the model (zeroshot to fewshot) when users label examples. If I cache zero-shot results, they're still zero-shot results.
  2. I'm running this in cloud run in a container, so making the cache actually work with the infrastructure would be troublesome.

I avoided upgrading past 1.11.14 because 1.12.0 (I think) compiled a lot of functions and inspect.getsource() was my main way to discover the prodigy API. It's possible that I could reach a similar understanding by carefully interpreting the docs and testing stuff, but just looking at source is way way faster. However, I have gotten over that hurdle and only had to change a few small things between 1.11.14 and 1.14.9.

Preliminarily, the behavior described in this thread does seem different on 1.14.9 than it did on 1.11.14. It seems to buffer approximately one batch at a time now, which means I have much better control over it and can just make batch size smaller to control the queue size. I will report back if this solution is insufficient, but I believe it will be enough!

1 Like

Happy to hear it! Let me know if you have any extra questions.

1 Like