Template-filling task with LLMs

Hello,

I am new here, just got a license and am excited about the potential here. I have a question about a custom task type that I'm having trouble implementing: template filling using LLMs to populate text fields in a first pass.

Basically, assume a chunk of text like so:

Hello I am currently located at 123 main st. and I'm trying to find the nearest McD's

And assume we want to fill a template that looks like this:

{
"intent": "poi_search", # classification
"current_location": "123 main st." # span labelling
"target_location": "McDonalds" # span labelling
"target_modifiers":["closest"] # assume values picked from a closed set: {closest, best_reviewed"}
}

I have played with builtin recipes like spans.llm.correct and it is close to what I want. Except that my task has a couple different task types going on, and the values in the template need not always be extractive (see how the user query "McD's" is transformed by the LLM into "McDonalds").

I'm wondering if there is a way to build this .correct style task, but allow the LLM to fill out the template represented as text fields directly, rather than showing text span labels (as these are only suitable for a subset of the labels we want).

Welcome to the forum @laneguage :slight_smile:

Do I understand correctly that for the input span:
Hello I am currently located at 123 main st. and I'm trying to find the nearest McD's
You'd like the spacy-llm task to return McDonalds rather that McD's (apart from the other spans)?
That's definitely possible if you can come up with the right prompt! If you can share the prompt that does what you want, we can help you with defining the custom task.

The thing is that I'm not sure how is that supposed to help with the creation of gold standard for training an ML model.
You can't really annotate a span that doesn't exist.
It looks like you want to use spacy-llm to preprocess the input, apply annotation suggestions to the transformed text and then use Prodigy to correct these labels?
Consequently your production pipeline would have two components: LLM based preprocessing and spancat ML model applied to its output. Does this sound right?

Yes thats right. I want the LLM to make a first pass at filling the template, then my annotators can correct the text fields. For example the prompt for the described task might be:

You are an expert information extractor for a navigation system. Given a user utterance and some conversation history, you must classify the intent, and extract meaningful entities. The template you must fill looks like this:

intent: can be poi_info if the user is looking for information about a place, or start_navigation if they want to go to the place now

current_location: this is the user's current location, as extracted from the context.

target_location: this is where the user wants to go, or wants more information about.

Respond in JSON, with the extracted value for each label, or None if there is no value.

This is the conversation history and context:

User is at 123 Main St.

This is the user's query:

hey how late is the nearest McD's open?

Expected response:

{
intent: poi_info
current_location: 123 Main St.
target_location: nearest McDonalds
}

The example is a bit contrived and oversimplified, but it reflects the spirit of what I'm trying to do. I want the LLM to make a first pass at extracting some template of values from some context and an utterance.

@laneguage,

Based on this thread, it sounds like the tasks at hand are:

  1. use spacy-llm to pre-process input i.e. expand acronyms;
  2. apply annotation suggestions to the transformed text;
  3. use Prodigy to correct these labels.

As @magdaaniol already suggested, it's not possible to annotate a span that doesn't exist so your best bet is coming up with a clever way of pre-processing the text for span annotation downstream.

I'd suggest:

  1. writing a script to transform given examples using a custom spacy-llm task;
  2. Running a spans.llm.correct recipe on the transformed examples.

Let's break this down further. See the below script:

"""
dotenv run -- python -m expand_text
"""

from typing import List, Dict, Iterable

import spacy
from spacy.tokens import Doc
from spacy_llm.registry import registry
from spacy_llm.util import assemble
from pathlib import Path

import srsly
from tqdm import tqdm

Doc.set_extension("extended_text", default=None, force=True)

@registry.llm_tasks("expand_text.ExpandTextTask.v1")
def expand_text_task(abbreviated_examples: List[Dict[str, str]]) -> "ExpandTextTask":
    return ExpandTextTask(abbreviated_examples=abbreviated_examples)

class ExpandTextTask:
    def __init__(self, abbreviated_examples: List[Dict[str, str]]):
        self.abbreviated_examples = abbreviated_examples  
        
    def generate_prompts(self, docs: Iterable[Doc]) -> Iterable[str]:
        prompts = []
        for doc in docs:
            # Create an examples section for the prompt to help the model understand the task better
            examples_section = "Here are some examples of acronyms or abbreviations and their full forms:\n" + \
                               "\n".join([f"{example['abbr']} means {example['full']}" for example in self.abbreviated_examples])
            # Generate a prompt for each document
            prompt = f"You are an expert at transforming acronyms or shortened terms into full forms.\n\n{examples_section}\n\nPlease identify and replace all acronyms with their full form in the following text: \"{doc.text}\""
            prompts.append(prompt)
        return prompts
    
    def parse_responses(self, docs: Iterable[Doc], responses: Iterable[str]) -> Iterable[Doc]:
        for doc, response in zip(docs, responses):
            doc._.extended_text = response
            yield doc

if __name__ == "__main__":
    # Example usage
    data_path = Path.cwd() / 'data'
    
    source = data_path / "examples.jsonl"
    transformed_source = data_path / "transformed_examples.jsonl"
    config_path = Path.cwd() / "configs/preprocess_text_config.cfg"

    #load data
    examples = srsly.read_jsonl(source)
    nlp = assemble(config_path)
    
    print("transforming data...")
    transformed_examples = []
    for eg in tqdm(examples):
        doc = nlp(eg["text"])
        extended_text = doc._.extended_text
        if extended_text:
            extended_text_clean = extended_text[0].strip().replace("\n", " ")
            transformed = True
        else:
            extended_text_clean = eg["text"]
            transformed = False
        transformed_example_dict = {"text": extended_text_clean,
                                    "meta": {"id": eg["meta"]["id"], "transformed": transformed}}
        transformed_examples.append(transformed_example_dict)
    
    print("saving transformed data...")
    srsly.write_jsonl(transformed_source, transformed_examples)

In this script, I:

  1. define a custom spacy-llm task that identifies and replaces abbreviations etc. with its full form.
  2. Load in my current unlabelled training data and transform the text using the custom spacy-llm task
  3. Keep track of important metadata, like the id associated to the text and if it was successfully transformed or not.
  4. Save out a new file locally that contains the transformed data.

The preprocess_text_config.cfg config file looks like this:

[nlp]
lang = "en"
pipeline = ["preprocess_text"]

[components]

[components.preprocess_text]
factory = "llm"
save_io = True 

[components.preprocess_text.task]
@llm_tasks = "expand_text.ExpandTextTask.v1"
abbreviated_examples = [
    {"abbr": "McDs",
    "full": "McDonalds"},
    {"abbr": "USA",
    "full": "United States of America"},
    {"abbr": "st.",
    "full": "street"}]

[components.preprocess_text.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.0}

Once I have my transformed data, I can define my spancat labelling config as follows:

[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"
save_io = True

[components.llm.task]
@llm_tasks = "spacy.SpanCat.v2"
labels = ["intent", "current_location", "target_location"]

[components.llm.task.label_definitions]
intent="Extract user intent. can be poi_info if the user is looking for information about a place, or start_navigation if they want to go to the place now"
current_location="Extract the user's current location, as described from the context."
target_location="Extract where the user wants to go, or wants more information about."

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 3
max_batches_in_mem = 10

And run the spacy.llm.correct recipe on my transformed text:

dotenv run -- python -m prodigy spans.llm.correct \
transformed-sents configs/spancat_config.cfg \
data/transformed_examples.jsonl

This should give you an interface like:

which should allow for annotators to correct the labels downstream.

Hopefully this helps! Do let us know how you end up pre-processing the text.

Hello!

This is great, definitely gets me close to what I'm trying to do. Yes, I recognize my task doesn't quite fit the mold of text/span classification, as it assumes the model's ability to do resolution of entities as it is extracting.

I'll give your suggestions a go, thank you so much for taking the time!

1 Like