Dynamically defining subset of labels to use in SpanCat

Hi! I'm trying to figure out how to wrap prodigy to my workflow.

Basically, I'm interested in detecting "concerns" from a predefined list of 50 or so based on a conversation inspired by a reddit post, usually made by a college student. I've been working on variations of this task for around 6 months now.

However, because the concern list is quite long, that leads to a couple issues when using prodigy.

  1. For LLM-powered span detection, the possible labels take up the whole screen. I can move the window to the side, but it seems like a hacky workaround and isn't very encouraged. I know the default advice is break it up into two+ annotation tasks, but that's a little much for me because I have categories, but they're kinda loosely defined (mental health, financial, etc..) so it's not very useful when doing span categorization, as the categories are too broad.

  2. When using GPT 3.5/4, I generally get good results. However, the model often fails to have reasonable definitions of the concerns, which I have worked around by providing handwritten definitions (Spacy.TextCat.v3). This ends up working well, but due to having 50 concerns, only performs well on GPT-4-1106-preview (and okay) on GPT-4. However, I know there's a lot of context being dropped for the definitions, and I don't want to overload the model. in an ideal world, it should only have the most relevant definitions so it can solidly reason about if a concern is present in the conversation or not.

To remedy this, I am trying to add an intermediate detection step, where instead of using all 50 labels for my annotation task, I call a weaker model like 3.5-turbo to get the 12? most likely concerns based on the conversation. Then, I can use these labels dynamically for the annotation task, which allows me to give longer, more precise definitions, and the model a lot more space to reason with, due to not overflowing the context limit.

How can I do this in Prodigy? Dynamically allocating labels does not seem supported, and I have tried working with the recipes, but I'm not sure where I would start. Detailed instructions would be so, so, SO appreciated. This project haunts me.

Finally, as a side note—is there a workflow/recipe for making predictions based on input text and validating them? After this task, for each Speaker 1 utterance, I'd like to identify (concern, parties affected, timeframe) based on the span-concern detection I just annotated. Would this just be a multi-label multi-step classification task? Ideally, I'd like to be able to do all 3 at once.

My code, for reference:

llm.config:

[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v3"
name = "gpt-3.5-turbo-1106"
config = {"temperature": 0.0}


[components.llm.task]
@llm_tasks = "spacy.SpanCat.v3"
labels = [
    "Anxiety",
    "Stress",
    "Depression",
    "Trauma",
    "Attention/concentration difficulties",
    "Grief/loss",
    "Emotion dysregulation",
    "Suicidality",
    "Anger management",
    "Obsessions or compulsions",
    "Mood instability (bi-polar symptoms)",
    "Dissociative experiences",
    "Psychotic thoughts or behaviors",
    "Family",
    "Relationship problem",
    "Interpersonal functioning",
    "Social isolation",
    "Racial/ethnic/cultural concerns",
    "Discrimination",
    "Academic performance",
    "Adjustment to new environment",
    "Career",
    "Financial",
    "Eating/body image",
    "Sleep",
    "Health/medical",
    "Alcohol",
    "Drugs",
    "Sexual concern",
    "Pregnancy related",
    ...,
    "NO CONCERN DETECTED"]

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-ner-cache"
batch_size = 10
max_batches_in_mem = 10

Prompt template:

You are an expert Concern Detection system.
Your task is to accept a Conversation as input and extract concerns expressed, subtly or directly, by speaker 1.
The entities you extract can overlap with each other.
{# whitespace #}
Entities must have one of the following labels: {{ ', '.join(labels) }}.
If a span is not an entity label it: `==NONE==`.
{# whitespace #}
{# whitespace #}
{%- if description -%}
{# whitespace #}
{{ description }}
{# whitespace #}
{%- endif -%}
{%- if label_definitions -%}
Below are definitions of each label to help aid you in what kinds of named entities to extract for each label.
Assume these definitions are written by an expert and follow them closely.
{# whitespace #}
{%- for label, definition in label_definitions.items() -%}
{{ label }}: {{ definition }}
{# whitespace #}
{%- endfor -%}
{# whitespace #}
{# whitespace #}
{%- endif -%}
{%- if prompt_examples %}
Q: Given the paragraph below, identify a list of entities, and for each entry explain why it is or is not an entity:
{# whitespace #}
{# whitespace #}
{%- for example in prompt_examples -%}
Paragraph: {{ example.text }}
Answer:
{# whitespace #}
{%- for span in example.spans -%}
{{ loop.index }}. {{ span.to_str() }}
{# whitespace #}
{%- endfor -%}
{# whitespace #}
{# whitespace #}
{%- endfor -%}
{%- else %}
{# whitespace #}
Here is an example of the output format for identifying student concerns within a conversation.
Only use this output format and the labels provided in the list of possible concerns.
Do not output anything besides entities in this output format.
Output entities in the order they occur in the conversation regardless of label.

Q: Given the conversation below, identify a list of entities related to student concerns, and for each entry explain why it is or is not an entity:

Conversation:

Speaker 1: I'm really struggling with feeling guilty about not being busy. I'm a first-year PhD student in psychology, and I just don't have much to do. I'm piloting two studies, but the data collection is slow, and my PIs don't want me to start anything new until these studies are well underway.

Speaker 2: That sounds tough. It's understandable to feel guilty, especially when you're used to being a hard worker."

Answer:
1. feeling guilty | True | Stress | indicates emotional distress related to not being busy
2. not being busy | True | Anxiety | may suggest concerns about productivity and self-worth
3. first-year PhD student | False | ==NONE== | identifies a role but is not a concern by itself
4. don't have much to do | True | Stress | reflects concerns over lacking work or activities
5. piloting two studies | False | ==NONE== | is a description of current academic activities, not a concern
6. data collection is slow | True | Academic performance | may lead to anxiety about meeting academic expectations
7. PIs don't want me to start anything new | True | Stress | could contribute to feelings of being stagnant or unproductive
8. used to being a hard worker | False | ==NONE== | describes a personal attribute, not a concern

{# whitespace #}
{# whitespace #}
{%- endif -%}
Conversation: {{ text }}
Answer:

input json line

{"text":"Speaker 1: So I just submitted my UC application a few days ago and I was just looking it over again because I
 was helping a friend and I realized that I submitted an essay under the wrong prompt. I meant to put it under prompt #4,
 but instead I put it under prompt #5. Is there any way to fix this or am I screwed?\n\n
Speaker 2: Oh, man, that sucks. I can't believe you made that mistake. Did you try contacting the admissions office? 
They might be able to help you out. But damn, that's gotta be stressful.",
"meta":{"sid":"r3vlry"}}

As I said above, I've tried giving definitions, but the issue is the context window.

Hope I provided enough detail. If there's anything else needed to help, please please let me know and I will provide it ASAP.

PS: If there is any documentation on using multiple spacy.llm calls in one config file, that would be super appreciated. I'd love an automatic reasoning check to make sure the model is doing alright before it sends me the results for annotation.

Edit: apologies for the massive amount of questions in one post. I've been putting my all into learning prodigy, as data quality/consistency issues are my biggest weakness atm. Dynamic-Labels is the most important thing.

hi @darinkishore,

Thanks for the thoughtful post and welcome to the Prodigy community :wave:

This is an interesting use case. Let me discuss this with the team next week.

We just had a related post and this has come up before too. But yes, the challenge is Prodigy was designed as "one-label-set-per-session".

I like this direction - but now it looks like you're hitting a second problem of nested or hierarchical categorization. This is another UI challenge but also one of designing the categorization scheme (e.g., why 12 of 50? how do you define which subsets).

Have you seen the validate answer callback? Easy way to validate an answer.

I don't think that's possible but perhaps open a discussion post GitHub or I can check with the team next week.

Update: I checked with a spaCy core teammate. So it is possible if you're using different components, the key is naming them something unique but still calling the "llm" factory.

[nlp]
pipeline = ["llm_textcat", "llm_ner"]
...

[components]

[components.llm_textcat]
factory = "llm"

[components.llm_textcat.model]
@llm_models = ...

[components.llm_textcat.task]
@llm_tasks = "spacy.TextCat.v3"

...

[components.llm_ner]
factory = "llm"

[components.llm_ner.model]
@llm_models = ...

[components.llm_ner.task]
@llm_tasks = "spacy.NER.v3"
...

No worries at all! We really appreciate your post. As I mentioned, we've been thinking about several of these items for a while. We'll reach back out next week. Thank you!

Thank you so much for the reply!!

We’re not really running into a nested/hierarchical problem here. It’s just, the label set is too big, definitions for all would overflow context window, so we need some arbitrary way to cut size.

In terms of the example you provided, can i maybe do a textcat and then spancat? Ideally, I’d have full configurability over the “intermediate” labels. Was thinking of using the new GPT-4 Logprobs.

Given this, do you happen to have any suggestions for implementation? Without dynamic labels, accuracy really suffers, so it’s hard to label all my data as fast as I feel prodigy could really do. I’ve been struggling in creating an eval set for the last couple weeks