Does Prodigy supports hierarchical annotation?

I would like to use Prodigy for hierarchical annotation. For example, if an example qualifies for annotation A, then annotate it and move on to next example, otherwise if it qualifies for B, then check if its also C, D, or E.

Or is there a way to customize prodigy for this kind of task.


Prodigy itself is pretty agnostic to what your annotations “mean” so you can definitely build a workflow like this. The only thing that’s kinda built-in is a strong focus on single decisions at a time and automating as much as possible.

One approach could be to use the choice interface and start by annotating the top-level buckets like A, B and C, without worrying about the lower-level categories. See here for an example recipe code. In the next step, you can then stream in the examples again and add different options, based on the top-level category that was selected – for example, A1 and A2 for A and so on. Prodigy streams are regular Python generators, so you can automate all of this logic by putting it in a function that yields annotation examples. For example, something like this:

hierarchy = {'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']}

def get_stream(examples):
    for eg in examples:   # the examples with top-level categories
        top_labels = eg['accepted']  # ['A'] or ['B', 'C'] if multiple choice
        for label in top_labels:
            sub_labels = hierarchy[label]
            options = [{'id': opt, 'text': opt} for opt in sub_labels]
            # create new example with text and sub labels as options
            new_eg = {'text': eg['text'], 'options': options}
            yield eg

Doing the levels in separate steps also allows you to iterate faster if you end up having to adjust the annotation scheme. Not all schemes are set in stone and if your annotators struggle with a top-level decision like B vs. C, they’ll likely also struggle with the lower-level decisions. So ideally, you want to find out about this as early as possible and before you commission the full fine-grained annotations on your entire corpus.

Hey Ines, @ines @MatthewC

I am trying to set up a hierarchical classification for classes and sub-classes. I did the first part of the method mentioned above to do the high level classification. I do not understand how the next step to provide next level hierarchy fits in with the first step.

Is the original data set required to be passed here? I am not sure what you mean by examples with top-level categories.

It would be really helpful if you could explain how the recipe for hierarchy works together with both the steps together?

I am getting the following error when i try to write one single recipe for entire process

I am trying to class first as fluid and mechanical and later as f1,f2 and m1,m2.
This is the snippet i used. image

Yes, those are supposed to be the examples you've previously annotated with the top-level categories (e.g. using a recipe like textcat.manual or any other custom recipe with the choice interface).

The eg["accept"] is referring to the "accept" key of the dictionary here, so you shouldn't modify that one. Its value is a list of labels that were selected in the UI. This is the format produced by the choice recipe – see here for an example:

So the top labels are coming from the data you previously annotated with those labels. And then for each of those examples, you create a new task with the lower-level labels as options. Also see here for a visual example of the concept:

This is the custom recipe I am using which will clean and annotate the data at the same time.

after selecting the options in the UI and annotating i try converting it to a json file using to-patterns command : python -m prodigy examples_eg hp.jsonl --label fluid,mechanical --spacy-model blank:en

This is the file i get. It is supposed to be either fluid or mechanical and not both image

I think this is the reason I am not able to set the hierarchy in the next step. Any thoughts on this?

My recipe for hierarchy is as follows image

I am getting the same error as I mentioned in my previous comment.

I might be missing something here.

Why are you converting to match patterns here? I don't think that's what you want to do – I think you just want to export the annotated examples? You can use the db-out command, or even load the annotations from the database programmatically in your recipe.

This means that eg is a string. I think you're missing the step that actually loads the examples? So whatever you pass in as examples (like the path to a file) is passed through here. So it's trying to acccess the index ['accept'] of a string like /path/to/something, which isn't going to work.

@ines Thank you so much for responding. I really appreciate it.

It makes sense now. Thank you.

This is where I am having trouble catching up.

I think I have got everything right till the part where I classify on a high level using custom recipe with choice interface into "fluid" and "mechanical". But later classifying fluid as f1 & f2 and mechanical as m1 & m2 is what I am having trouble.

I am not quite sure of how to load the data to set up the hierarchy. Do you think I should set a generator function which takes the examples line by line and indexes the ['accept'].
I apologize for taking up your time but I am trying to understand the underlying workflow here to set up a hierarchy custom recipe. It would be really helpful if you could direct me towards an example of such workflow setup.

Thanks in advance.

So the workflow I was proposing would look something like this:

  1. Annotate some examples with textcat.manual the top-level categories, e.g. "fluid" and "mechanical".
  2. Load the data created in the first step in your recipe and create new questions for each example. For instance, for every example you've annotated with "fluid", create a new question that now has the options "f1" and "f2".
  3. Annotate again and you have a dataset with all top-level categories and lower-level categories.

So if you've saved your annotations from step 1 in a dataset called textcat_top_level, you can run prodigy db-out textcat_top_level ./output to save a file textcat_top_level.jsonl to the directory output. You can then use that as the input in your recipe and use the JSONL loader to load the examples.

(For some background on custom recipes, you might also find my video here useful. It's a pretty different topic but I'm also trying to explain the overall concept of recipe scripts and how the pieces fit together.)

Thank you very much for your responses. It does clear a lot of questions for me.

1 Like