llm.fetch doesn't write to the database if it gets interrupted

I was trying to use the llm.fetch recipe to save annotations to a database but i have found that nothing gets saved if the process gets terminated early before it gets through all of the examples in the source data. this has happened on openai's end when the server doesn't respond for some reason. wondering if this is a bug? I'm using the latest version 1.14.12. thanks

dotenv run -- prodigy textcat.llm.fetch fewshot_openai.cfg testdb.jsonl dataset:testdb

Hi there.

I just checked our implementation and it indeed seems like we wait until the entire set is collected. I can also see how this is unideal in the context of LLMs so I'll write up a ticket for this.

One thing that might help in the meantime though: did you turn on caching for spaCy-LLM? The docs give more info:

By turning this on, you'll maintain a cache of each example that you receive even if there's a hiccup. That way, you can recover from where the previous stream stopped without loosing examples. Would that help in the meantime?

Thanks that’s great. Good tip about the cache. I’ll use that in the meantime and maybe batch the source data to send through. Happen to have any tips or helpers through prodigy or spacy to help with the batching?

just to follow up: i get errors when trying to pipe in to the source argument for textcat.llm.fetch. my idea was to just loop through batches of lines in the jsonl file and progressively add in to the dataset with --resume.

Getting labels from the 'llm' component
Using 5 labels: ['ASSESS', 'BREATHING', 'MOVEMENT', 'OTHER', 'PAIN']
:information_source: RECIPE: Resuming from previous output file:
dataset:openai_test_batch
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/aaronconway/audiosedstate/venv/lib/python3.10/site-packages/prodigy/main.py", line 50, in
main()
File "/Users/aaronconway/audiosedstate/venv/lib/python3.10/site-packages/prodigy/main.py", line 44, in main
controller = run_recipe(run_args)
File "cython_src/prodigy/cli.pyx", line 123, in prodigy.cli.run_recipe
File "cython_src/prodigy/cli.pyx", line 124, in prodigy.cli.run_recipe
File "/Users/aaronconway/audiosedstate/venv/lib/python3.10/site-packages/prodigy/recipes/llm/textcat.py", line 155, in llm_fetch_textcat
total = sum(1 for _ in stream.copy())
File "cython_src/prodigy/components/stream.pyx", line 374, in prodigy.components.stream.Stream.copy
File "cython_src/prodigy/components/source.pyx", line 371, in prodigy.components.source.GeneratorSource.copy
TypeError

cat test.jsonl | dotenv run -- prodigy textcat.llm.fetch fewshot_openai.cfg - dataset:openai_test_batch --loader jsonl --resume

test.jsonl

{"text": "It's okay.", "meta": {"pid": "P018", "segment": "18", "start_time": "105.537", "end_time": "106.778"}}
{"text": "One chance to get a fresh breath there.", "meta": {"pid": "P018", "segment": "19", "start_time": "106.778", "end_time": "109.279"}}
{"text": "Oh, it feels good.", "meta": {"pid": "P018", "segment": "20", "start_time": "109.279", "end_time": "111.32"}}
{"text": "So this smells like the plastic that it's made of.", "meta": {"pid": "P018", "segment": "21", "start_time": "111.32", "end_time": "113.301"}}
{"text": "All watered up now.", "meta": {"pid": "P018", "segment": "24", "start_time": "116.583", "end_time": "121.606"}}

fewshot_openai.cfg

[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"
save_io = true

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v2"
config = {"temperature": 0.0}

[components.llm.task]
@llm_tasks = "spacy.TextCat.v3"
labels = MOVEMENT,BREATHING,ASSESS,PAIN,OTHER
exclusive_classes = true

[components.llm.task.label_definitions]
MOVEMENT = "A specific instruction for a person to stay still or stop moving. It may be to stop moving a particular body part like hands or arms or legs, or just to stop moving and stay still in general."
BREATHING = "A specific instruction to either stop breathing or start breathing."
ASSESS = "A question asking a person how they are feeling."
PAIN = "A person describing their pain or discomfort."
OTHER = "Everything else that does not fall into the other categories."

[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.jsonl"

[components.llm.task.normalizer]
@misc = "spacy.LowercaseNormalizer.v1"

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 3
max_batches_in_mem = 10

Hi @awconway,

To take a step back, I think the best way to incrementally add examples to the DB is to use the batch_size parameter of the add_examples database method. The problem is that we don't expose this argument on the recipe level (and we probably should - adding that to my TODOs), so you'd have to modify the recipe itself.
You can access the recipe source code code in Prodigy package /path/to/prodigy/recipes/llm/textcat.py (to double check where Prodigy was installed you can run prodigy stats and check the path under Location).
In there, on line 185 you could try adding batch_size=5 (it defaults to 64) so:

db.add_examples(stream, batch_size=5, datasets=[dataset])

Now to explain what's the ugly error you are seeing in your latest run.
Whenever stdin is used as source in Prodigy, it's being interpreted as source of "generator" type. Since the reimplementation of the Stream and Source in Prodigy 1.12, we have moved away from the generic "generator" type of source and implemented specific source types depending, for example, on the type of file including a new GeneratorSource.
The GeneratorSource was implemented for backward compatibility and as a fallback in case the input cannot be parsed as any known source subtype. And that includes stdin as source.
LLM recipes are the latest recipes that were developed after the transition to the new Stream and Source and it turns out that textcat.llm.fetch uses a copy method that was not implemented as as part of the interface of the new GeneratorSource class, which is an error on our end.
Therefore, as a workaround, I'd suggest passing the input as a JSONL file and not stdin if possible.