What does green tick and red x do for choice tasks?

usage
custom
front-end

(C Swart) #1

Hi,

I was wondering what the green tick and red x do for choice tasks? After labelling should I always select the green tick or does the red x or skipping do anything? See picture attached:


(Ines Montani) #2

Yes, the accept button means “yes” and the REJECT button means “no”. The ignore button skips the question. Under the hood, the decision will be added as the "answer" key of your annotation task, i.e. "answer": "accept".

How you utilise the binary feedback depends on how you define your annotation task. For example, once you’re done selecting the options, you can accept the task. If the question contains errors (i.e. corrupted data) you can reject the task. You can also use the reject action to create negative examples for training – depending on what you’re trying to use the annotations for later on, this might also be valuable. Having the distinction between reject and ignore can also be helpful when working with external annotators – for example, they can ignore the task they don’t understand the question, and reject it if they understand the question but want to send it back as “wrong”.

In your case, you’re annotating in multiple choice mode, so clicking accept will confirm that you’re done annotating that task, and that your annotations are considered correct. If you’re using the single-choice mode, you can also set "choice_auto_accept": true in your prodigy.json or the 'config' option returned by your recipe to automatically accept the selected option. (This only makes sense for single choice, though – in multiple choice mode, you’ll still need a button to submit the task and confirm that you’re done.)


(C Swart) #3

When using choice mode is there a built in way to build classifier on the choice categories? So is there a command I can run to see the classification results of FINANCIAL, PHYSICAL, MENTAL, DISABILITY vulnerability in the above example?

I am also curious if there’s an easy way to add the seed option to my custom recipe?

Another question I have come across during annotating today is if I start another annotation session with another data source and the same dataset then if I ahve annotated a document once already can it appear in the next session?


(Ines Montani) #4

No, but that’s a nice idea! You can easily write your own little converter script for this, though:

from prodigy.components.db import connect

db = connect()  # connect to the database
examples = db.get_dataset('choice_dataset')  # get the dataset

textcat_examples = []  # collect reformatted examples here

for eg in examples:
    accepted = eg.get('accept', [])  # get the list of accepted IDs, e.g. ['FINANCIAL']
    for accepted_id in accepted:
        textcat_examples.append({'text': eg['text'], 'label': accepted_id})

You can then save out the textcat_examples to a JSONL file and add it to a dataset using db-in, or add it to your database straight away by creating a new dataset and adding the list of examples to it. You should then be able to use that dataset to train with textcat.batch-train.

If you want to do this even more elegantly, you could also add an on_exit hook for your recipe that is run when you exit the Prodigy server, and automatically adds the reformatted tasks to a new dataset. The on_exit function takes the controller as its argument, which gives you access to the database and the already annotated examples of the current session. You can find an example of this in the custom recipes workflow.

def on_exit(ctrl):
    # get annotations of current session
    examples = ctrl.db.get_dataset(ctrl.session_id)
    textcat_examples = convert_examples(examples)  # convert the examples
    # add them to your other dataset (needs to exist in the database)
    ctrl.db.add_examples(textcat_examples, datasets=('textcat_examples'))

This depends on what exactly you’re trying to do – do you want to recreate the seed selection functionality of the textcat recipes in your custom choice recipe? You can see how the stream with seeds is composed in prodigy/recipes/textcat.py, or use the PatternMatcher from the NER recipes to find terms in your incoming stream. A stream of annotation examples is just a simple generator btw – so you can also implement your own, custom matching logic.

By default, Prodigy tries to make as little assumptions about your streams as possible. Within the same session, duplicate tasks will be filtered out – but when you start a new session, Prodigy will not assume any state. However, once this bug is resolved in the upcoming release, you’ll be able to specify the --exclude argument or return a list of dataset IDs as the 'exclude' setting returned by your recipe. This will tell Prodigy to not ask you questions that were already annotated in that dataset. For example, you can set it to the current dataset name, or use the ID of your evaluation set to make sure that examples don’t appear in both your training and evaluation set.


(C Swart) #5

Thanks for the helpful reply Ines. If I could suggest it would be great if the seed section could be factored out into a decorator pattern looking at the code this is not straightforward unless the sorter is injected into the seeding decorator.

Copying the seeding section baed on teach I came across an issue with the lack of documentation around some built in methods. For bits of code where we can’t check the code could you add some form of documentation?

from prodigy.util import get_seeds, get_seeds_from_set, log

How does get_seeds work? It would just be more developer friendly if I didn’t need to guess it. Just having a method signature would help.


(Ines Montani) #6

Thanks, I definitely see your point in terms of making the recipes more readable and easier to follow.

The seed terms are nothing more than a simple set of strings. Since the built-in recipes allow passing in seeds as a comma-separated string, a file path or, in other cases, a dataset ID, we wrote some helper functions that take care of resolving the value of the seeds argument to a set of strings. For example, all get_seeds really does is take the seeds value the user passed in, check if it’s a file or a list of comma separated terms and process it accordingly so you’ll always get out a set.

Our assumption has been that if you write your own recipes, you’ll be able to choose the input format more freely anyways, and process the incoming data however you want. So instead of having to learn our arbitrary API of helper functions, you can just write your own logic – e.g., open a file, read it in and return a list.

I’ll go through the recipes again and add more comments whenever internals / helpers like this are used. This way, it’ll at least be clear that seeds will be a set of strings etc. I’ll also make sure to add more detailed docstrings to our internal utils, so you’ll be able to call help() on them properly if you want to know what they’re doing (maybe this is weird, but I actually really enjoy writing docstrings, haha.)