hierarchical text classification using spancat and potentially expanding/hiding label subclasses as they come in context

I have nested labels that follow a structure, if the leaf label is true, the parent is deduced to be true. I would love if prodigy can handle this case. Currently the UI only supports simple flat list of labels making it unwieldy, There are two ways suggested, one to use a flat list as if these are all independent classes, The other way to handle this, is by splitting things into multiple recipes which means each annotator has to re-read the text rather than tagging a specific hierarchy in one go. If the labels are not hierarchical the actual text on the label can't be kept succinct wasting screen real estate. Having a conditional annotation hierarchy really helps mentally organize these classes and annotate with increasing specificity similar to walking a DFS graph and is intuitive for the annotator.
I understand training the classifier is easier with a flat overlapping labels, but the annotation workflow is my primary focus right now, I can easily flatten the hierarchy later for classification if needed.
Also would have preferred if the default annotation component used the space on the screen, currently ~40% space is white/empty and the large number of labels just smoosh the text.

Hi Dhurv,

could you share a bit more about the use-case? Are the nested labels ingredients in a recipe or something else? If I'm able to understand the application better I'm also able to give better advice.

Screen Real-Estate

With regards to screen real-estate, you're able to change the CSS of the interface. You can adapt the global CSS to make adapt the UI elements to your liking.

For example, here's a demo interface with base settings.

But if I adapt my local prodigy.json file to contain:

    "global_css": ".prodigy-container {max-width: 800px}"

Then the container element becomes wider.

Oh Hey Vincent! Nice to meet you here,
Thanks for the reply.
Yes spent some time going through the history of discussions, and I landed on similar settings, another thing that helped increase visibility is if I created color coding into specific root level hierarchies, by coloring the labels. I will be able to share some snaps later, but I think the only way to really make this experience non-intimidating is if the label buttons could expand an area with sublevels, almost analogous to walking down a tree.

Here is the prodigy.json settings that I have landed on for the time being.

"global_css": ".prodigy-container {max-width: 100%;} .prodigy-title label { font-size: 12px; font-family: sans-serif!important; font-weight: bold!important} [data-prodigy-label=\"<LABEL1>\"] { background: yellow;} [data-prodigy-label=\"<LABEL2>\"] {background: red;} [data-prodigy-label=\"<LABEL3>\"] {background: blue;} [data-prodigy-label=\"<LABEL4>\"] { background: green;} (continue for all labels)"

Now regarding the use-case. We are dealing with classifying short conversation summary notes from field agents into many overlapping topics that dive deeper into specificity conditioned on presence and absence of a topic hierarchy. Typically these topics interact with each other giving rise to structural patterns, but it is hard to predict how these topics will co-occur until you have done the human annotations. The volume of data is super low < 100 / week but the content is extremely rich > 60 topics of interest, with ~ 5-10 root level topics. We have annotators who read and spend time understanding each note and then like to go across all the topics that they feel are relevant to the conversation and label them. Note that I'm using the term topic intentionally and not classes. When we train our classifiers, they can be more pointed and the notion of classes is derived from a collection of these topics interacting in a certain way. So, in turn we are designing our annotation experience closer to how domain experts think and not necessarily based on a specific recipe or a machine learning experiment, but there is a translation path to those. This helps us make the most use of the annotator time without having to ask them to re-read the same note over and over. Hope that helps with the context.

Suppose we have a model that can make predictions, how will these predictions be used in practice? Will it attach tags to a message so that the right agent can be found for the issue? Are false positives worse than false negatives? What is the consequence of a good/bad prediction? Are the tags consistent over time? Or are they added/removed?

I'm mainly asking because understanding the business domain might help us design an annotation process and/or help us narrow the classes down.

If there are only 100 examples per week, yet 60 possible topics, the application might get unwieldy. But if we can narrow it down to the 5 most important topics, then we might still be able to provide value while we keep the complexity at bay.

You might enjoy reading this blogpost I wrote a while ago. It touches on a similar issue (adding tags to Github issues) and might offer some inspiration for your project. One of the conclusions there was that it can be fine to just focus on one tag in order to provide value but it also talks about tags that appear/disappear over time.

For now as a starting point, we don't have models, we are purely using human annotated data for driving narratives. I was hoping the UI experience of tagging in Prodigy to be a tad better than doing this in excel, again just as a starting point for our annotators.
To answer your questions:

how will these predictions be used in practice? Will it attach tags to a message so that the right agent can be found for the issue?
It's slightly more subtle than that, the notes in each topic category are evaluated in light of business context (external to the note) and summarized into actionable takeaways driving strategy. They form narratives as collections and interactions of collections.

Are false positives worse than false negatives?
Missing a key topic is worse than a false positive, it is similar to any research /search problem, you can never be sure that you are done and you have consumed all the needed topics.

What is the consequence of a good/bad prediction?
If they can be determined as bad predictions, we loose faith in the annotator, if they are not, we might miss on a potential strategic lead.

Are the tags consistent over time? Or are they added/removed?
The tags change month/month ~ 10%. We add new tags and retire ones that do not make sense any more.

One of the conclusions there was that it can be fine to just focus on one tag in order to provide value but it also talks about tags that appear/disappear over time.
That is a potential approach, but here that would mean we have 60 re-reads on the text of the note. I think if you look at the annotation workflow as a function of classification complexity and time to annotate x records. I have a hunch we might land midway limiting the number of labels to something people can quickly handle, while minimizing the re-read iterations.

Since the volume is so low we are starting with "all data gets labeled and enriched". Some of these topics will then get a concrete class definition, with potential weak labels in terms of c-TFIDF or something similar as the next step. While some classes would work with simpler approaches others might need more complex transformer based models. The more explainable a class definition is the better, explainability is more important than model performance.
We are early in this particular space.
BTW I love the simplicity of prodigy and I know I am asking to handle complexity, yet make it appear simple. I'm aware of the inherent contradiction in my ask.