Getting access to annotations before placed in db

I am currently making a textcat validation set with many labels and I need a fixed amount of positive examples for each label. I would like to detect how many times the annotator has ‘accepted’ a proposed text for the label. Right now I changed the config file so:

'config': {
            'batch_size': 10
        }

and then count the current label every time a new task is given in my filter_stream function.

There are a couple problems with this. One, I can’t go back if I mislabel something because it’s saving to the db every annotation. Two, its slow and it gets slower the more annotations I have.

Ideally I could get the annotations before they are placed in the db so I don’t have to change the batch_size or grab the db to recount. Is there some way to do this?

Maybe you could implement this logic in your recipe’s update method? It’s returned as the 'update' component and is called whenever the server receives a new batch of annotations from the web app, before they are stored in the database. You could then have a global variable counting the labels (or a more elegant solution, depending on what you need). For example:

from collections import Counter

label_counts = Counter()

def update(examples):
    nonlocal label_counts
    for eg in examples:
        label = eg['label']
        label_counts[label] += 1
        # etc.

One thing that’s important to keep in mind is that Prodigy won’t let you edit examples before they are stored. This is by design, to prevent the update method from accidentally modifying the annotations and messing up datasets this way. (And the records in the database should always reflect exactly what the annotator saw and worked on.)

Is there any way to exclude certain keys from the task dict, before saving to the DB? I’m displaying base64-encoded images to the annotators, but do not need to store those for later - which it currently does (I have the image data stored separately already, and can match a key from the task dict to the image if required on my end)

@AjinkyaZ Ah yes, I definitely see the point, and this is probably one of the few cases where this would make sense. The reason Prodigy doesn’t usually allow editing tasks before placing them in the DB is that the saved annotations should always match exactly what the annotator saw and what was used to render the task. Otherwise, it’s too easy to accidentally end up with corrupted and mismatched data, without a way to reproduce the annotations.

However, one thing you could do is implement the saving manually via the update method, which is called before the tasks are usually stored in the database. To disable the default storing, you can set 'db': False. In the update method, you could then overwrite the 'image' value, and then call db.add_examples to add the edited answers to your dataset. Something like this should work:

from prodigy.components.db import connect
from prodigy.components.loaders import Images
import copy

@prodigy.recipe('custom-recipe')
def custom_recipe(dataset, source):
    db = connect()  # uses the settings in your prodigy.json
    stream = Images(source)

    def update(answers):
        for eg in answers:
            eg['image'] = None # or something
        # add the answers to the dataset
        db.add_examples(answers, [dataset])

    return {
        'view_id': 'image',
        'dataset': dataset,
        'stream': stream,
        'db': False,
        'update': update
    }

Yep, I get why modification might become an issue. I tried what you suggested, but it gives an error on the 'db': False part. This is the stacktrace:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/prodigy/pgy-env/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 65, in prodigy.core.Controller.__init__
  File "/home/prodigy/pgy-env/lib/python3.6/site-packages/prodigy/recipes/generic.py", line 33, in fill_memory
    examples = ctrl.db.get_dataset(dataset)
AttributeError: 'bool' object has no attribute 'get_dataset'

Code:

@prodigy.recipe('custom-mark',
    dataset=('Dataset ID', 'positional', None, str),
    view_id=('Annotation interface', 'option', 'v', str),
    memorize=('Enable answer cache', 'flag', 'M', bool),
    port=('Port to run application on', 'option', 'p', str),
    exclude=('Exclude data from dataset', 'option', 'e', str))
def my_custom_recipe(dataset, view_id='choice', source=None, memorize=False, port=8080, exclude=None):
    db = connect()    
    with open('template_page.html') as tmp:
        html_template = tmp.read()

    stream = add_options(JSONL(source), html_template)
    stream = fetch_images(stream)
    
    def update(answers):
        for i in answers:
            i['image'] = None
        logging.debug(answers)
        # this function is triggered when Prodigy receives annotations
        print("Received {} annotations!".format(len(answers)))
        db.add_examples(answers, [dataset])
    
    config = {'choice_auto_accept': True,
            'html_template': html_template,
            'instructions': './instructions_page.html',
            'custom_theme': {
                'cardMaxWidth': '1000px',
                },
            'port': port,
            'host': '0.0.0.0'
            }
    components = mark(dataset=dataset, source=stream, memorize=True, exclude=[dataset])
    components['view_id'] = view_id
    components['config'] = config
    components['update'] = update
    components['db'] = False
    return components

From what I understand, leaving the db param as it is would prevent the add_examples method from the custom update function from triggering, right?

Ah, I think the problem here occurs because you’re wrapping a built-in recipe that calls into the database. It also means that you’re overwriting the mark recipe’s update method, which is used to update the memory (when used with memorize=True).

So I think at this point, it’s probably easiest to just write your own adaption of the mark recipe. If you look at generic.py, you’ll see that the recipe itself is actually pretty straightforward and not that complex. Instead of ctrl.db.get_dataset, you can then connect to the database directly at the top of the function. And in the recv_answers method, you can add the logic that saves the examples to the database manually.

I’ll try that out, thanks!
EDIT: Ended up switching to loading the images from S3 as sending encoded images was slower. The above approach worked, thanks.

Hi,
my intention is to save annotations into my custom DB while prodigy annotations are being saved.
I copy-pasted the "textcat" recipe from /prodigy/recipes/textcat.py and created a new custom one. I added two print instructions (to console and log) in "update()" function but do not see the output when I hit save button.

My update function:

def update(answers):
    print("yuri_answers:", answers)
    log("yuri_answers:", answers)

Does it mean that the "update()" function is not called?

Entire recipe:

# coding: utf8
from __future__ import unicode_literals, print_function

import spacy
# import random
# import tqdm
# import copy
# from spacy.util import minibatch, fix_random_seed
#
# from ..models.matcher import PatternMatcher
# from ..models.textcat import TextClassifier
# from ..components import printers
from prodigy.components.loaders import get_stream
from prodigy.components.preprocess import split_sentences, add_label_options
from prodigy.components.preprocess import add_labels_to_stream
# from ..components.preprocess import convert_options_to_cats
# from ..components.db import connect
# from ..components.sorters import prefer_uncertain
from prodigy.core import recipe, recipe_args
# from ..util import export_model_data, split_evals, get_print
from prodigy.util import combine_models, prints, log, set_hashes

@recipe(
    "textcat_manual_custom",
    dataset=recipe_args["dataset"],
    spacy_model=recipe_args["spacy_model"],
    source=recipe_args["source"],
    api=recipe_args["api"],
    loader=recipe_args["loader"],
    label=recipe_args["label_set"],
    exclusive=recipe_args["exclusive"],
    exclude=recipe_args["exclude"],
)
def manual(
    dataset,
    spacy_model,
    source=None,
    api=None,
    loader=None,
    label=None,
    exclusive=False,
    exclude=None,
):
    print("textcat_manual_custom()")
    """
    Manually annotate categories that apply to a text. If more than one label
    is specified, categories are added as multiple choice options. If the
    --exclusive flag is set, categories become mutually exclusive, meaning that
    only one can be selected during annotation.
    """
    log("RECIPE: Starting recipe textcat_manual_custom", locals())
    nlp = spacy.load(spacy_model)
    log("RECIPE: Loaded model {}".format(spacy_model))
    labels = label
    has_options = len(labels) > 1
    log("RECIPE: Annotating with {} labels".format(len(labels)), labels)
    stream = get_stream(
        source, api=api, loader=loader, rehash=True, dedup=True, input_key="text"
    )
    if has_options:
        stream = add_label_options(stream, label)
    else:
        stream = add_labels_to_stream(stream, label)

    def update(answers):
        print("yuri_answers:", answers)
        log("yuri_answers:", answers)

    return {
        "view_id": "choice" if has_options else "classification",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "config": {
            "lang": nlp.lang,
            "labels": labels,
            "choice_style": "single" if exclusive else "multiple",
        },
    }

Here is how I initiate Prodigy server:

prodigy textcat_manual_custom dataset_1 en_core_web_sm input.jsonl --label "L1","L2" -F "/full/path/to/textcat_manual_custom.py"

Thank you!

I think the update callback isn't called because your recipe is not returning it. Try adding "update": update to the dict returned by your recipe function.