@laneguage,
Based on this thread, it sounds like the tasks at hand are:
- use
spacy-llm
to pre-process input i.e. expand acronyms;
- apply annotation suggestions to the transformed text;
- use Prodigy to correct these labels.
As @magdaaniol already suggested, it's not possible to annotate a span that doesn't exist so your best bet is coming up with a clever way of pre-processing the text for span annotation downstream.
I'd suggest:
- writing a script to transform given examples using a custom
spacy-llm
task;
- Running a
spans.llm.correct
recipe on the transformed examples.
Let's break this down further. See the below script:
"""
dotenv run -- python -m expand_text
"""
from typing import List, Dict, Iterable
import spacy
from spacy.tokens import Doc
from spacy_llm.registry import registry
from spacy_llm.util import assemble
from pathlib import Path
import srsly
from tqdm import tqdm
Doc.set_extension("extended_text", default=None, force=True)
@registry.llm_tasks("expand_text.ExpandTextTask.v1")
def expand_text_task(abbreviated_examples: List[Dict[str, str]]) -> "ExpandTextTask":
return ExpandTextTask(abbreviated_examples=abbreviated_examples)
class ExpandTextTask:
def __init__(self, abbreviated_examples: List[Dict[str, str]]):
self.abbreviated_examples = abbreviated_examples
def generate_prompts(self, docs: Iterable[Doc]) -> Iterable[str]:
prompts = []
for doc in docs:
# Create an examples section for the prompt to help the model understand the task better
examples_section = "Here are some examples of acronyms or abbreviations and their full forms:\n" + \
"\n".join([f"{example['abbr']} means {example['full']}" for example in self.abbreviated_examples])
# Generate a prompt for each document
prompt = f"You are an expert at transforming acronyms or shortened terms into full forms.\n\n{examples_section}\n\nPlease identify and replace all acronyms with their full form in the following text: \"{doc.text}\""
prompts.append(prompt)
return prompts
def parse_responses(self, docs: Iterable[Doc], responses: Iterable[str]) -> Iterable[Doc]:
for doc, response in zip(docs, responses):
doc._.extended_text = response
yield doc
if __name__ == "__main__":
# Example usage
data_path = Path.cwd() / 'data'
source = data_path / "examples.jsonl"
transformed_source = data_path / "transformed_examples.jsonl"
config_path = Path.cwd() / "configs/preprocess_text_config.cfg"
#load data
examples = srsly.read_jsonl(source)
nlp = assemble(config_path)
print("transforming data...")
transformed_examples = []
for eg in tqdm(examples):
doc = nlp(eg["text"])
extended_text = doc._.extended_text
if extended_text:
extended_text_clean = extended_text[0].strip().replace("\n", " ")
transformed = True
else:
extended_text_clean = eg["text"]
transformed = False
transformed_example_dict = {"text": extended_text_clean,
"meta": {"id": eg["meta"]["id"], "transformed": transformed}}
transformed_examples.append(transformed_example_dict)
print("saving transformed data...")
srsly.write_jsonl(transformed_source, transformed_examples)
In this script, I:
- define a custom
spacy-llm
task that identifies and replaces abbreviations etc. with its full form.
- Load in my current unlabelled training data and transform the text using the custom
spacy-llm
task
- Keep track of important metadata, like the
id
associated to the text and if it was successfully transformed
or not.
- Save out a new file locally that contains the transformed data.
The preprocess_text_config.cfg
config file looks like this:
[nlp]
lang = "en"
pipeline = ["preprocess_text"]
[components]
[components.preprocess_text]
factory = "llm"
save_io = True
[components.preprocess_text.task]
@llm_tasks = "expand_text.ExpandTextTask.v1"
abbreviated_examples = [
{"abbr": "McDs",
"full": "McDonalds"},
{"abbr": "USA",
"full": "United States of America"},
{"abbr": "st.",
"full": "street"}]
[components.preprocess_text.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.0}
Once I have my transformed data, I can define my spancat labelling config as follows:
[nlp]
lang = "en"
pipeline = ["llm"]
[components]
[components.llm]
factory = "llm"
save_io = True
[components.llm.task]
@llm_tasks = "spacy.SpanCat.v2"
labels = ["intent", "current_location", "target_location"]
[components.llm.task.label_definitions]
intent="Extract user intent. can be poi_info if the user is looking for information about a place, or start_navigation if they want to go to the place now"
current_location="Extract the user's current location, as described from the context."
target_location="Extract where the user wants to go, or wants more information about."
[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}
[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 3
max_batches_in_mem = 10
And run the spacy.llm.correct
recipe on my transformed text:
dotenv run -- python -m prodigy spans.llm.correct \
transformed-sents configs/spancat_config.cfg \
data/transformed_examples.jsonl
This should give you an interface like:
which should allow for annotators to correct the labels downstream.
Hopefully this helps! Do let us know how you end up pre-processing the text.