Is this a known issue? I have a few long documents in my dataset and am seeing errors about too many tokens emanating from the call to OpenAI.
I have tried modifying the config to add a sentencizer component only to find that prodigy reports an error when the recipe tries to add its own sentencizer.
Command line:
dotenv run -- prodigy ner.llm.correct --segment XXXX llm.cfg XXX.jsonl
It is a known headache that OpenAI has strict token limits. Especially if you have a long prompt (which can be the result of adding a bunch of few-shot examples) it can add up.
One avenue that might aid you is to switch to a model backend that allows for longer token windows.
Thank you. I definitely get the token limitation which is why I was running the recipe with "--segment".
My core problem is really that the recipe in this version of prodigy doesn't actually split my documents into sentences when the "--segment" argument is provided or when I configure the pipeline to have a "sentencizer" ahead of the "llm" component. From looking at the recipe code, I can't really see why that's not happening.
I'm having no issues outside of prodigy making an LLM pipeline based on that config with sentencizer in place. This code does fine:
import spacy
nlp = spacy.load("ner-llm")
doc = nlp("... two sentences ...")
for s in doc.sents:
print(s.text)
for e in s.ents:
print(e.text, e.label_)
print()
Another reason why it is surprising is because, internally, the recipe does this:
if segment:
nlp.add_pipe("sentencizer")
stream.apply(split_sentences, nlp=nlp, stream=stream)
Is it possible to share your full config with a single example? That way I might be better able to reproduce what you are seeing.
One thing
We just released version 1.14.0, which comes with a new CLI parser. Just to rule out any CLI mishaps, could you upgrade Prodigy and see if the issue persists?
I did update to 1.14 and everything seems to be working fine. Not really sure what was going on. I think it is possible that I got fouled up by not clearing cache or something between a run where I didn't have --segment and one where I did. Thanks for taking a look.