Inverting annotations in dataset

Hello, did something silly.

I annotated the equivalent of “positive” comments, when in reality, I was looking for “negative”. Is there a way to invert the positive labels without doing a custom recipe? How would I transform the dataset?


That should be no problem – you’ll have to write some code, but it’s pretty straightforward. So if I understand the question correctly, you want to flip the accept and reject decisions, right? If so, you could do something like this:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset('name_of_your_dataset')
for eg in examples:
    answer = eg['answer']
    if answer == 'accept':
        eg['answer'] == 'reject'
    elif answer == 'reject':
        eg['answer'] == 'accept'

# add the inverted examples to a new dataset
db.add_examples(examples, datasets=['name_of_inverted_dataset'])

If you want to actually change the labels, you can also just overwrite eg['label'].


Aha! Okay, another silly question. Does the directory that I write this code in matter? I’m shelled into the virtual env that I’ve installed prodigy into, but I haven’t actually found the files… maybe windows is hiding them?

Yes, as long as you’re in the correct virtualenv, you can run the code from anywhere.

The Prodigy library should have been installed in your site-packages within the environment. You can find the full path by typing the following (or doing the equivalent from within the Python interpreter):

python -c "import prodigy; print(prodigy.__file__)"

Btw, speaking of directories: You can even have a prodigy.json or .prodigy.json in your current working directory to overwrite the global configuration. So you can create one directory per project, put all your source files and/or custom recipes in there and then use the local config for project-specific settings (and, if needed, share it with your teammates etc).

Awesome. Really useful stuff to know, thanks.

1 Like

One more q: when I invert the examples, what happens to the scores in the metadata? Do I need to redo them?

And how to go about continuing training? When I call textcat.teach, do I call the same jsonl file I called originally? Trying to figure out if examples are automatically filtered out.


The scores that are assigned to the examples are only relevant for the active learning component during annotation. They’re added to the metadata so you can see them while annotating – but after that, they’re not actually used anymore and only there for reference.

By default, Prodigy makes as little assumptions as possible about the incoming data. But you can use the --exclude argument to tell it to exclude annotations that are already present in the current (or any other) dataset(s). For example:

prodigy textcat.teach your_dataset en_core_web_sm your_data.jsonl --exclude your_dataset

(This is also useful for creating evaluation sets, to make sure none of your training examples accidentally end up in your evaluation data, and vice versa.)

Awesome thanks. If I’m in long-text mode and skip a comment because I can’t see the context, does that get included in the exclusion?

Yeah, the exclude filter only looks at the hashes and includes all annotations in the set, also ignores. If you want to be able to re-annotate ignored tasks, you can just remove them from your set, e.g in the script you’re using to invert the answers.

examples = [eg for eg in examples if eg['answer'] != 'ignore']