Inverting annotations in dataset

obeavers · February 15, 2018, 6:49pm

Hello, did something silly.

I annotated the equivalent of “positive” comments, when in reality, I was looking for “negative”. Is there a way to invert the positive labels without doing a custom recipe? How would I transform the dataset?

Thx.

ines · February 15, 2018, 7:13pm

That should be no problem – you’ll have to write some code, but it’s pretty straightforward. So if I understand the question correctly, you want to flip the accept and reject decisions, right? If so, you could do something like this:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset('name_of_your_dataset')
for eg in examples:
    answer = eg['answer']
    if answer == 'accept':
        eg['answer'] == 'reject'
    elif answer == 'reject':
        eg['answer'] == 'accept'

# add the inverted examples to a new dataset
db.add_dataset('name_of_inverted_dataset')
db.add_examples(examples, datasets=['name_of_inverted_dataset'])

If you want to actually change the labels, you can also just overwrite eg['label'].

obeavers · February 15, 2018, 7:42pm

Aha! Okay, another silly question. Does the directory that I write this code in matter? I’m shelled into the virtual env that I’ve installed prodigy into, but I haven’t actually found the files… maybe windows is hiding them?

ines · February 15, 2018, 7:50pm

Yes, as long as you’re in the correct virtualenv, you can run the code from anywhere.

The Prodigy library should have been installed in your site-packages within the environment. You can find the full path by typing the following (or doing the equivalent from within the Python interpreter):

python -c "import prodigy; print(prodigy.__file__)"

Btw, speaking of directories: You can even have a prodigy.json or .prodigy.json in your current working directory to overwrite the global configuration. So you can create one directory per project, put all your source files and/or custom recipes in there and then use the local config for project-specific settings (and, if needed, share it with your teammates etc).

obeavers · February 15, 2018, 7:51pm

Awesome. Really useful stuff to know, thanks.

obeavers · February 15, 2018, 8:53pm

One more q: when I invert the examples, what happens to the scores in the metadata? Do I need to redo them?

And how to go about continuing training? When I call textcat.teach, do I call the same jsonl file I called originally? Trying to figure out if examples are automatically filtered out.

Thx!

ines · February 15, 2018, 9:07pm

The scores that are assigned to the examples are only relevant for the active learning component during annotation. They're added to the metadata so you can see them while annotating – but after that, they're not actually used anymore and only there for reference.

By default, Prodigy makes as little assumptions as possible about the incoming data. But you can use the --exclude argument to tell it to exclude annotations that are already present in the current (or any other) dataset(s). For example:

prodigy textcat.teach your_dataset en_core_web_sm your_data.jsonl --exclude your_dataset

(This is also useful for creating evaluation sets, to make sure none of your training examples accidentally end up in your evaluation data, and vice versa.)

obeavers · February 15, 2018, 9:52pm

Awesome thanks. If I’m in long-text mode and skip a comment because I can’t see the context, does that get included in the exclusion?

ines · February 15, 2018, 9:58pm

Yeah, the exclude filter only looks at the hashes and includes all annotations in the set, also ignores. If you want to be able to re-annotate ignored tasks, you can just remove them from your set, e.g in the script you’re using to invert the answers.

examples = [eg for eg in examples if eg['answer'] != 'ignore']

Topic		Replies	Views
Are 'Reject' examples included in textcat_multilabel train/train-curve?	5	248	November 19, 2022
Using Prodigy to confirm or reject existing document labels usage , textcat , solved	2	612	January 5, 2019
Curation and re-annotation usage	2	892	January 9, 2020
Saving and retrieving annotations usage , database , custom , solved	7	5105	June 13, 2018
Change some annotations for existing dataset usage , ner , database , review	1	847	September 23, 2020

Inverting annotations in dataset

Related topics