Span and TextCat but with a LLM

This sample is very close to what i want but i am trying to use a LLM to tag spans and select the categories. prodigy-recipes/tutorials/span-and-textcat/recipe.py at master · explosion/prodigy-recipes (github.com) I have trued loading it via cfg, in code and even creating additional pipelines but without sucess. Is it possible to leverage a LLM if you merge the 2.

Welcome to the forum @Netizine :wave:

If what you'd like to do is to correct simultaneously the span and textcat suggestions from an LLM, the easiest way about it would be to just pre-annotate your dataset with spans and textcat using an LLM and use that as input to the recipe similar to the one you linked. The spans-and-textcat recipe just uses blocks UI with spans and textcat as components so as long as your input stream contains the right annotations under the right keys it doesn't really matter if the annotation was done within or outside the recipe. In fact, pre-annotating outside has an important advantage of allowing preannotation in large batches which is much faster and rules out the possibility of annotators waiting on the API etc.
It's totally fine, for this purpose, to do spans and textcat preannoation in two seperate steps. This way we can reuse the built-in recipes and we won't have to worry about defining multiple llm components in the pipeline.

Suggested steps in detail (using news_headline dataset as example)

  1. Pre-annotate dataset with spans using using spans.llm.fetch
dotenv run -- python -m prodigy spans.llm.fetch spancat.cfg  ../../news_headlines.jsonl ./spans-annotated-news.jsonl

This adds LLM suggestions for spans under spans key which is where spans_manual UI expects it.

  1. Pre-annotate the output of step 1 with textcat using textcat.llm.fetch
dotenv run -- python -m prodigy textcat.llm.fetch textcat.cfg  spans-annotated-news.jsonl spans-textcat-annotated-news.jsonl

This will add textcat annotations under accept key.

  1. Load the span and textcat annotated data to a custom recipe that uses blocks view id with spans and textcat as components.
    The recipe could be as simple as this:
import prodigy
from prodigy.components.stream import get_stream


@prodigy.recipe(
    "span-and-textcat.correct",
    dataset=("Dataset to save annotations into", "positional", None, str),
    file_in=("Path to examples.jsonl file", "positional", None, str),
)
def custom_recipe(dataset, file_in):

    stream = get_stream(file_in)
    span_labels = ["PER", "ORG", "LOC"]

    blocks = [
        {"view_id": "spans_manual"},
        {"view_id": "choice", "text": None},
    ]
    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "config": {  # Additional config settings, mostly for app UI
            "blocks": blocks,
            "labels": span_labels,
            "choice_style": "multiple",
        },
    }

Now you can curate these annotations without LLM in the loop by calling:

python -m prodigy span-and-textcat.correct test spans-textcat-annotated-news.jsonl -F recipe.py

This should result in the following interface letting you correct all the labels:

@magdaaniol This worked fantastically well. In fact, it's more than I could have hoped for. The one thing I would love to be able to do in the same example is actually Link to the chat as explained here. This would be the killer demo if the span-and-texcat sample could Link To Chat :wink: Deploying a Prodigy cloud service for Posh’s financial chatbots · Explosion

1 Like

Glad to hear it worked for you @Netizine!
Just adding a link is really simple. You would add an html block to the existing blocks and a function to format the link from the input file to an html template.
Assuming the link is stored under meta.chat_link keys in the input jsonl:

import prodigy
from prodigy.components.stream import get_stream

LINK_TEMPLATE = """
<div class="cleaned" style="display: flex; justify-content: center; align-items: center; max-width: 80%; margin: 0 auto; padding: 10px; border: 1px solid #ddd; border-radius: 10px; background-color: #f9f9f9;">
    {{#chat_link}}
        <a href="{{chat_link}}" target="_blank" style="margin: 0; padding: 0; text-decoration: none; color: inherit;">Link to the full chat</a>
    {{/chat_link}}
    {{^chat_link}}
        <span style="margin: 0; padding: 0;">No link available</span>
    {{/chat_link}}
</div>
"""


def add_chat_link(stream):
    for eg in stream:
        chat_link = eg.get("meta", {}).get("chat_link")
        eg["chat_link"] = chat_link
        yield eg


@prodigy.recipe(
    "span-and-textcat.correct",
    dataset=("Dataset to save annotations into", "positional", None, str),
    file_in=("Path to examples.jsonl file", "positional", None, str),
)
def custom_recipe(dataset, file_in):

    stream = get_stream(file_in)
    span_labels = ["PER", "ORG", "LOC"]
    stream = add_chat_link(stream)

    blocks = [
        {"view_id": "spans_manual"},
        {"view_id": "choice", "text": None},
        {"view_id": "html", "html_template": LINK_TEMPLATE},
    ]
    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "config": {  # Additional config settings, mostly for app UI
            "blocks": blocks,
            "labels": span_labels,
            "choice_style": "multiple",
        },
    }

I'm using here the built-in html template rendering engine Mustache but you can use jinja2 or another. Using a template will make it a bit easier to add styling etc. If the link is stored in the input file under the same key that is used in the template you don't even need any extra functions. Prodigy should render the template correctly directly from the data available in the task.
Note that if you don't want to make the link so conspicuous, you can also just use meta and it will be rendered in the low-right corner: