How do you correct labels when you already have a prodigy.db full of labelled data?

bev.manz · November 29, 2022, 11:13am

I read about textcat.correct but I'm not sure it is what I'm after. Essentially I want a process that runs a quick version of the model with the live database, and then allows me to relabel/correct items in the database that someone may have labelled incorrectly. I have created a bit of a long workaround to extract data from database, run a bert model, then take the incorrect test predictions and push them back into prodigy after removing them from the database. This is quite longwinded and I was hoping there's a way to just launch prodigy to relabel items in current database that it is finding maybe incorrect predictions by the model?

ryanwesslen · November 29, 2022, 6:56pm

hi @bev.manz!

You can create a custom recipe that slightly tweaks the textcat.correct so it'll only serve incorrect predictions (if you have the known predictions you can compare your model's predictions to).

You can start with the recipe's script by finding your Prodigy path location. Run python -m prodigy stats and find Location: folder. Then look for the recipes/textcat.py where you'll find the textcat.correct recipe.

Key would be this part that add the model's suggestions (predictions) and only yield those that are incorrect:

# Add classifier predictions to each task in stream under 'options' key with the score per category
# and 'selected' key with the categories above the threshold.
def add_suggestions(stream):
    texts = ((eg["text"], eg) for eg in stream)
    # Process the stream using spaCy's nlp.pipe, which yields doc objects.
    # If as_tuples=True is set, you can pass in (text, context) tuples.
    for doc, eg in nlp.pipe(texts, as_tuples=True, batch_size=10):
        task = copy.deepcopy(eg)
        options = []
        selected = []
        for cat, score in doc.cats.items():
            if cat in labels:
                options.append({"id": cat, "text": cat, "meta": f"{score:.2f}"})

                # change to logic for correct prediction (i.e., selected = corrected)
                if score >= threshold:
                    selected.append(cat)
        task["options"] = options
        task["accept"] = selected
        yield task

The key is to change the logic of the selected (aka, those that are auto-accepted/skipped for annotation). You could instead provide some logic that are incorrect predictions.

To load as your recipe's source (i.e., current dataset, which we can call my_labeled_dataset) you can add dataset: as a prefix, so dataset:my_labeled_dataset. See the docs for more details.

You may also appreciate the docs on how to load/customize recipes (e.g., -F my_recipe.py) and customize their arguments.

Hope this helps and let us know if you have further questions!

Topic		Replies	Views
textcat.teach not taking into account label value textcat , done	4	601	December 7, 2018
Textcat possible problem with uneven dataset? usage , textcat , done	2	956	January 17, 2020
Textcat - teach to train. usage , textcat	2	553	September 1, 2022
CRUD operations on previously labeled spacy data usage , spacy , solved	1	502	November 15, 2021
How can I improve a textcat model? usage , textcat	1	764	May 6, 2019

How do you correct labels when you already have a prodigy.db full of labelled data?

Related topics