Template-filling task with LLMs

@laneguage,

Based on this thread, it sounds like the tasks at hand are:

  1. use spacy-llm to pre-process input i.e. expand acronyms;
  2. apply annotation suggestions to the transformed text;
  3. use Prodigy to correct these labels.

As @magdaaniol already suggested, it's not possible to annotate a span that doesn't exist so your best bet is coming up with a clever way of pre-processing the text for span annotation downstream.

I'd suggest:

  1. writing a script to transform given examples using a custom spacy-llm task;
  2. Running a spans.llm.correct recipe on the transformed examples.

Let's break this down further. See the below script:

"""
dotenv run -- python -m expand_text
"""

from typing import List, Dict, Iterable

import spacy
from spacy.tokens import Doc
from spacy_llm.registry import registry
from spacy_llm.util import assemble
from pathlib import Path

import srsly
from tqdm import tqdm

Doc.set_extension("extended_text", default=None, force=True)

@registry.llm_tasks("expand_text.ExpandTextTask.v1")
def expand_text_task(abbreviated_examples: List[Dict[str, str]]) -> "ExpandTextTask":
    return ExpandTextTask(abbreviated_examples=abbreviated_examples)

class ExpandTextTask:
    def __init__(self, abbreviated_examples: List[Dict[str, str]]):
        self.abbreviated_examples = abbreviated_examples  
        
    def generate_prompts(self, docs: Iterable[Doc]) -> Iterable[str]:
        prompts = []
        for doc in docs:
            # Create an examples section for the prompt to help the model understand the task better
            examples_section = "Here are some examples of acronyms or abbreviations and their full forms:\n" + \
                               "\n".join([f"{example['abbr']} means {example['full']}" for example in self.abbreviated_examples])
            # Generate a prompt for each document
            prompt = f"You are an expert at transforming acronyms or shortened terms into full forms.\n\n{examples_section}\n\nPlease identify and replace all acronyms with their full form in the following text: \"{doc.text}\""
            prompts.append(prompt)
        return prompts
    
    def parse_responses(self, docs: Iterable[Doc], responses: Iterable[str]) -> Iterable[Doc]:
        for doc, response in zip(docs, responses):
            doc._.extended_text = response
            yield doc

if __name__ == "__main__":
    # Example usage
    data_path = Path.cwd() / 'data'
    
    source = data_path / "examples.jsonl"
    transformed_source = data_path / "transformed_examples.jsonl"
    config_path = Path.cwd() / "configs/preprocess_text_config.cfg"

    #load data
    examples = srsly.read_jsonl(source)
    nlp = assemble(config_path)
    
    print("transforming data...")
    transformed_examples = []
    for eg in tqdm(examples):
        doc = nlp(eg["text"])
        extended_text = doc._.extended_text
        if extended_text:
            extended_text_clean = extended_text[0].strip().replace("\n", " ")
            transformed = True
        else:
            extended_text_clean = eg["text"]
            transformed = False
        transformed_example_dict = {"text": extended_text_clean,
                                    "meta": {"id": eg["meta"]["id"], "transformed": transformed}}
        transformed_examples.append(transformed_example_dict)
    
    print("saving transformed data...")
    srsly.write_jsonl(transformed_source, transformed_examples)

In this script, I:

  1. define a custom spacy-llm task that identifies and replaces abbreviations etc. with its full form.
  2. Load in my current unlabelled training data and transform the text using the custom spacy-llm task
  3. Keep track of important metadata, like the id associated to the text and if it was successfully transformed or not.
  4. Save out a new file locally that contains the transformed data.

The preprocess_text_config.cfg config file looks like this:

[nlp]
lang = "en"
pipeline = ["preprocess_text"]

[components]

[components.preprocess_text]
factory = "llm"
save_io = True 

[components.preprocess_text.task]
@llm_tasks = "expand_text.ExpandTextTask.v1"
abbreviated_examples = [
    {"abbr": "McDs",
    "full": "McDonalds"},
    {"abbr": "USA",
    "full": "United States of America"},
    {"abbr": "st.",
    "full": "street"}]

[components.preprocess_text.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.0}

Once I have my transformed data, I can define my spancat labelling config as follows:

[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"
save_io = True

[components.llm.task]
@llm_tasks = "spacy.SpanCat.v2"
labels = ["intent", "current_location", "target_location"]

[components.llm.task.label_definitions]
intent="Extract user intent. can be poi_info if the user is looking for information about a place, or start_navigation if they want to go to the place now"
current_location="Extract the user's current location, as described from the context."
target_location="Extract where the user wants to go, or wants more information about."

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 3
max_batches_in_mem = 10

And run the spacy.llm.correct recipe on my transformed text:

dotenv run -- python -m prodigy spans.llm.correct \
transformed-sents configs/spancat_config.cfg \
data/transformed_examples.jsonl

This should give you an interface like:

which should allow for annotators to correct the labels downstream.

Hopefully this helps! Do let us know how you end up pre-processing the text.