ner.llm.correct with --segment presents full documents in the UI

Is this a known issue? I have a few long documents in my dataset and am seeing errors about too many tokens emanating from the call to OpenAI.

I have tried modifying the config to add a sentencizer component only to find that prodigy reports an error when the recipe tries to add its own sentencizer.

Command line:

dotenv run -- prodigy ner.llm.correct --segment XXXX llm.cfg XXX.jsonl

Config:

[nlp]
lang = "en"
pipeline = ["llm"]

[components]


[components.llm]
factory = "llm"
save_io = true

[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = [...OMITTED...]

[components.llm.task.label_definitions]
....OMITTED....

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}

[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "examples.yml"

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 3
max_batches_in_mem = 10

Installed packages:

python             3.11
prodigy            1.13.2
spacy              3.6.1
spacy-llm          0.4.3

Hi Chris.

It is a known headache that OpenAI has strict token limits. Especially if you have a long prompt (which can be the result of adding a bunch of few-shot examples) it can add up.

One avenue that might aid you is to switch to a model backend that allows for longer token windows.

Both of these models are supported directly in spaCy-LLM, per the docs here:

Does this help? If it does not, could you share the error message?

Thank you. I definitely get the token limitation which is why I was running the recipe with "--segment".

My core problem is really that the recipe in this version of prodigy doesn't actually split my documents into sentences when the "--segment" argument is provided or when I configure the pipeline to have a "sentencizer" ahead of the "llm" component. From looking at the recipe code, I can't really see why that's not happening.

I'm having no issues outside of prodigy making an LLM pipeline based on that config with sentencizer in place. This code does fine:

import spacy

nlp = spacy.load("ner-llm")
doc = nlp("... two sentences ...")

for s in  doc.sents:
    print(s.text)
    for e in s.ents:
        print(e.text, e.label_)
    print()

That is strange. I just tried it locally and it does seem to work. I'm using this examples.jsonl file.

{"text": "Spaghetti Bolognaise is a great dish. But this is another sentence. So there you have it." }

With this config:

[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = ["PERSON", "ORGANISATION", "LOCATION"]

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "tests/cache/ner"
batch_size = 3
max_batches_in_mem = 4

This call yields the screenshot below.

dotenv run -- python -m prodigy ner.llm.correct issue-6800 config.cfg examples.jsonl

This call, with --segment yields a different screenshot.

dotenv run -- python -m prodigy ner.llm.correct issue-6800 config.cfg examples.jsonl --segment

Another reason why it is surprising is because, internally, the recipe does this:

if segment:
    nlp.add_pipe("sentencizer")
    stream.apply(split_sentences, nlp=nlp, stream=stream)

Is it possible to share your full config with a single example? That way I might be better able to reproduce what you are seeing.

One thing

We just released version 1.14.0, which comes with a new CLI parser. Just to rule out any CLI mishaps, could you upgrade Prodigy and see if the issue persists?

I did update to 1.14 and everything seems to be working fine. Not really sure what was going on. I think it is possible that I got fouled up by not clearing cache or something between a run where I didn't have --segment and one where I did. Thanks for taking a look.

1 Like

No worries. Happy to hear it's working again :slightly_smiling_face:!