spans.llm.correct seems to have good llm response but no highlighting

First post, first prodigy experiment, please be lenient :innocent:.

I have set up a spans.llm.correct task (using spacy.SpanCat.v3 task and spacy.GPT-4.v3/gpt-4o model) for marking references to literature in text with an llm in the loop, and I wonder why I don't get any highlighting to manually confirm or reject. The text to be processed is just displayed without any highlighting at all. When I expand the llm prompt and response, they do look meaningful and there should be information enough for prodigy to highlight something. See screenshot.

This is prodigy 1.15.4., spacy 3.7.5, and spacy-llm 0.7.2.

Maybe it is a problem with markdown emphasis (asterisks) that the llm has introduced in its response?

Thanks for any insights!

As a side note: Do you think it would be better to handle this as an NER task? The only thing that will ever be overlapping is the whole reference encompassing its components author, title, pages etc. And NER seems to me to maybe open more options when using other llms, training/finetuning etc.

Welcome to the forum @awagner-mainz :wave:

You're right - the default parsing function of SpanCat task is not handling correctly the responses you receive. If you look at the main parsing logic, you'll notice that upon splitting on | it expects 4 items, while in your case:

the first | is being substituted with a comma which will lead to ValueError which, in turn, will lead to skipping the response and not creating any spans that Prodigy can use to highlight.
Now, how to fix it:

Option 1

Try to modify the prompt, by explicitly stating it should not add any extra characters such as e.g. * or that the entity should be an exact span from the paragraph etc.
You can easily modify the prompt by submitting a custom prompt template via config:

@llm_tasks = "spacy.SpanCat.v3"
labels = ["COMPLIMENT", "INSULT"]
template = "path/to/the/template.jinja2"

You can use the original SpanCat template as a starting point.

Option 2

Since the offending characters seem to be consistently * you could also write your own task.
spaCy docs on how to define custom tasks can be found here.
This blog shows step by step how to define custom task class using a simpler use case: Elegant prompt versioning and LLM model configuration with spacy-llm | by Déborah Mesquita | Towards Data Science
This just to have an overview. Your case will a bit more complex because you'll need to make sure the answers are valid spans with respect to the text. You can find the current logic here: spacy-llm/spacy_llm/tasks/span/ at 117f68963870fd2a4af4c706c40cf223c6ae6fde · explosion/spacy-llm · GitHub.
You'll need to recreate most of it adding the extra logic for cleaning up *- do let us know if you run into problems implementing that! (Sharding is optional - you don't have to worry about it for the first version.)

As to spancat vs NER question:
The very high level difference between spancat and NER is that spancat generates candidates for a span to which a classifier assigns probabilities. NER, on the other hand, predicts when an entity starts and ends.
This why, in general for spans that have well defined boundaries NER tends to yield better results than spancat. Also, it is usually easier to model atomic entities, so if your overlapping case (the REFERENCE) can be compound of sub entities, I would recommend annotating simpler entities and infer the compund ones in post processing (e.g. via rules that specify that a combination of given entities is a REFERENCE - you could use spaCy Entity Ruler for this).

It might be that for some of your entities NER will be a better fit and for others perhaps it will be spancat. Yet another solutions could be just rule-based matching. It always takes some experimentation.
The good news is, if we eliminate the need for overlapping spans, the NER and spanscat annotations are essentially just spans and it's very easy to to transform NER annotations to spancat annotations to experiment and compare the performance of spaCy NER and spancat on the same data. Assuming you have your NER annotations in spaCy DocBin format (e.g. using data-to-spacy recipe), such function that filters a subset of categories could look something like this:

def rewrite_as_spans(nlp, dataset, split, dataset_dir, target_ner_file, cats):
    docbin = DocBin().from_disk(dataset)
    docs = list(docbin.get_docs(nlp.vocab))
    for doc in docs:
        old_ents = doc.ents
        new_ents = []
        new_spans = []
        for ent in doc.ents:
            if ent.label_ in cats:
        assert len(list(old_ents)) == len(new_spans) + len(new_ents)
        doc.spans["sc"] = new_spans
    output_path = f"{dataset_dir}/{str(target_ner_file)}_{split}.spacy"
    DocBin(docs=docs, store_user_data=True).to_disk(output_path)
    msg.good(f"Saved rewritten {split} datast to {output_path}")
1 Like

Great! Thanks a lot, @magdaaniol! Very helpful in my concrete issue, and much more food for thought and study than I had hoped for :+1:

With NER vs spancat, I was thinking mostly in terms of further using the data I will be creating. I am assuming it will be easier to make NER training data actually profitable, because there are many models with "simple NER" heads already available on HF. Whereas, with the spancat task and training data, training/finetuning would be more involved. I admit that this is all kind of speculating since I have never trained or finetuned a model yet, but does it sound like there could be factors such as these down the road?

I would be ready to close the topic, should I do so? Or will you?

Thanks again, and best wishes,

Ah you mentioned this aspect and I forgot to comment on it :sweat_smile: I agree with your hunch. There will certainly be more community support in terms of NER-pretrained models which is an important advantage for sure. The main reason to opt for spancat is if you really have to model overlapping spans (which doesn't seem to be your case).

Good luck with the project and let us know how it's going!

(I'll "close" it by adding done tag - thanks!)

1 Like