I am using a custom recipe that applies two models to records before yielding them in the preprocess()
function. The app seems to buffer >30 records (2.5 minutes) before the Uvicorn server starts serving.
The recipe converts a zeroshot GPT3.5 model into a fewshot one, so I don't necessarily need a ton of examples but do need to customize them for a lot of data sets. Is there some way to drop the target buffer queue depth so I can bring my boot latency down? IMO it would be fine to start the app with a single record buffer.
Verbose logs were:
18:31:26: INIT: Setting all logging levels to 10
18:31:26: CLI: Importing file /opt/extraxion/prodigy_utils/assisted_recipe.py
18:31:26: CLI: Importing file /opt/extraxion/spacy_utils/axion_tasks.py
18:31:26: RECIPE: Calling recipe 'rel.assisted'
ℹ RECIPE: Starting recipe rel.assisted
{'dataset': 'dataset_out', 'ner_model': '/opt/models/ner', 'rel_model':
'/opt/models/rel', 'source': 'records.jsonl', 'loader': None, 'exclude': None,
'wrap': False, 'hide_arrow_heads': False, 'ner_label_component': 'llm',
'rel_label_component': 'llm', 'label': None, 'span_label': None, 'threshold':
None}
18:31:26: LOADER: Using file extension 'jsonl' to find loader
records.jsonl
18:31:26: LOADER: Loading stream from jsonl
18:31:26: LOADER: Rehashing stream
18:31:27: CONFIG: Using config from global prodigy.json
/opt/prodigy.json
18:31:27: CONFIG: Using config from working dir
/opt/prodigy.json
18:31:27: CONTROLLER: Initialising from recipe
{'before_db': None, 'config': {'lang': 'en', 'labels': ['A_LABEL', 'A_COOL'], 'relations_span_labels': ['A','COOL','LABEL'], 'exclude_by': 'input', 'wrap_relations': False, 'custom_theme': {'cardMaxWidth': '90%'}, 'hide_relation_arrow': False, 'auto_count_stream': True, 'dataset': 'dataset_out', 'recipe_name': 'rel.assisted', 'batch_size': 1, 'history_size': 0, 'buttons': ['accept', 'ignore', 'undo'], 'feed_overlap': False, 'swipe': False, 'swipe_gestures': {'left': 'accept', 'right': 'ignore'}, 'validate': False, 'instant_submit': True}, 'dataset': 'dataset_out', 'db': True, 'exclude': None, 'get_session_id': None, 'metrics': None, 'on_exit': None, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0xffff7e086e60>, 'self': <prodigy.core.Controller object at 0xffff7e06abf0>, 'stream': <generator object assisted.<locals>.preprocess_stream at 0xffff7dc5b530>, 'update': <function assisted.<locals>.make_update at 0xffff7e07e710>, 'validate_answer': <function assisted.<locals>.validate_answer at 0xffff7e07f1c0>, 'view_id': 'relations'}
18:31:27: CONFIG: Using config from global prodigy.json
/opt/prodigy.json
18:31:27: CONFIG: Using config from working dir
/opt/prodigy.json
18:31:27: DB: Initializing database SQLite
18:31:27: DB: Connecting to database SQLite
18:31:27: DB: Creating dataset 'dataset_out'
Added dataset dataset_out to database SQLite.
18:31:27: DB: Creating dataset '2023-11-02_18-31-27'
{'created': datetime.datetime(2023, 11, 2, 18, 31, 27)}
18:31:27: FEED: Initializing from controller
{'auto_count_stream': True, 'batch_size': 1, 'dataset': 'dataset_out', 'db': <prodigy.components.db.Database object at 0xffff7df900d0>, 'exclude': ['dataset_out'], 'exclude_by': 'input', 'max_sessions': 10, 'overlap': False, 'self': <prodigy.components.feeds.Feed object at 0xffff7d85ef20>, 'stream': <generator object assisted.<locals>.preprocess_stream at 0xffff7dc5b530>, 'target_total_annotated': None, 'timeout_seconds': 3600, 'total_annotated': 0, 'total_annotated_by_session': Counter(), 'validator': None, 'view_id': 'relations'}
before the for
18:31:27: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0xffff7e070cc0>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0xffff7f4dfd30>>, 'warn_threshold': 0.4}
18:31:27: FILTER: Filtering out empty examples for key 'text'
sample 0
sample 1
sample 2
sample 3
sample 4
sample 5
sample 6
sample 7
sample 8
sample 9
sample 10
sample 11
sample 12
sample 13
sample 14
sample 15
sample 16
sample 17
sample 18
sample 19
sample 20
sample 21
sample 22
sample 23
sample 24
sample 25
sample 26
sample 27
sample 28
sample 29
sample 30
sample 31
sample 32
sample 33
sample 34
18:33:37: STREAM: Counting iterator exceeded timeout of 10 seconds after 3 tasks
sample 35
18:33:37: CORS: initialized with wildcard "*" CORS origins
✨ Starting the web server at http://0.0.0.0:8804 ...
Open the app in your browser and start annotating!
INFO: Started server process [15]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8804 (Press CTRL+C to quit)
sample 36
sample 37
sample 38
18:33:50: STREAM: Counting iterator exceeded timeout of 10 seconds after 7 tasks
"before the for"
is a print statement from just before the preprocess()
generator's for loop
f"sample {i}"
is a print statement just inside the generator's for loop
So, the stream yields >30 documents before prodigy
starts the Uvicorn server (I've seen 35 to 46), yields a few more after the server starts, and then it stops and waits for me to label some before it does more.
From searching through the forum for similar cases:
- I observe this regardless of
batch_size
- I observe this regardless of
validate
orfeed_overlap
- I observe this with both a 100 record input file and a 5000+ record input file.
- I don't observe this as a problem with a 3 record input file (of course)
prodigy
version 1.11.14pydantic
version 1.10.13
Thanks for any help!