textcat.teach with multiple choice interface?

I am annotating a corpus with 5 labels, bought, sold, buyer, seller, owns. Documents can have more than 1 label (sold and owns for example).

When I use textcat.teach with exclusive left set to false I am getting a pretty unbalanced set of annotations. Is it possible to use textcat.teach with the multiple choice UI used by textcat.manual so that can label more than one category when they appear. For example when something like "John sold a camera, and owns wide angle lens", I want to annotate with sold and own, but textcat.teach only asks me sold applies. I suppose at a latter stage textcat.teach might ask me if the own label applies to the doc, but as I said I am getting very unbalanced annotations. Not many accepted 'own' labels.

thanks.

Hi! We don't currently have a built-in workflow for this, but you should be able to implement it with a custom recipe or by slightly modifying the textcat.teach recipe. Internally, Prodigy's annotation models for text classification should now be able to handle both the "text" + "label", as well as the "options" + "accept" format. So before you send out the examples for annotation, you can add the "options" (all labels), pre-select the suggested label ("label") and then use the choice interface in the app.

We haven't tried out this exact workflow and it can sometimes make the active learning a little trickier if you're updating all labels at the same time. But if you try it out, I'd definitely be curious to hear how well it works :slightly_smiling_face:

Thanks will hack around and report back.

1 Like

@Superscope did it work? do you mind sharing the recipe?

1 Like

@Superscope Hi, this is exactly what I need. Did you find any solution to this?

hi @JanP and @gladiator , I gave on this. TBH I find annotating one label at a time more productive. I then export the labels, and merge them outside prodigy, with the new dataset reflecting any overlapping labels. Apologies for not reporting back earlier.

1 Like

+1 on this. A direct way to use textcat.teach with the multiple choice interface where the annotator is correcting a trained model's predictions (instead of simply accepting or rejecting) would be incredibly useful for certain scenarios

EDIT: The solution offered here: Is it possible to use model-in-the-loop with multi text classification using the "choice" view_id? is incomplete but provides a sketch of how one might accomplish this.

Here's my complete, working solution (as of Prodigy 1.10.4):

from typing import List, Optional, Dict, Any, Union, Iterable
from pathlib import Path
import spacy

from prodigy.models.matcher import PatternMatcher
from prodigy.models.textcat import TextClassifier
from prodigy.components.loaders import get_stream
from prodigy.components.preprocess import add_label_options, add_labels_to_stream
from prodigy.components.sorters import prefer_uncertain
from prodigy.core import recipe
from prodigy.util import combine_models, log, msg, get_labels, split_string

# Restore deprecated recipes
from prodigy.deprecated.train import textcat_batch_train as batch_train  # noqa: F401
from prodigy.deprecated.train import textcat_train_curve as train_curve  # noqa: F401

@recipe(
    "textcat_multi",
    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    spacy_model=("Loadable spaCy model or blank:lang (e.g. blank:en)", "positional", None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
    loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    patterns=("Path to match patterns file", "option", "pt", str),
    long_text=("DEPRECATED: Use long-text mode", "flag", "L", bool),
    init_tok2vec=("Path to pretrained weights for the token-to-vector parts of the model. See 'spacy pretrain'.", "option", "t2v", str),
    exclude=("Comma-separated list of dataset IDs whose annotations to exclude", "option", "e", split_string),
    # fmt: on
)
def textcat_multi(
    dataset: str,
    spacy_model: str,
    source: Union[str, Iterable[dict]],
    label: Optional[List[str]] = None,
    patterns: Optional[str] = None,
    init_tok2vec: Optional[Union[str, Path]] = None,
    loader: Optional[str] = None,
    long_text: bool = False,
    exclude: Optional[List[str]] = None,
) -> Dict[str, Any]:
    """
    Collect the best possible training data for a text classification model
    with the model in the loop. Based on your annotations, Prodigy will decide
    which questions to ask next.
    """
    log("RECIPE: Starting recipe textcat.teach", locals())
    if label is None:
        msg.fail("textcat.teach requires at least one --label", exits=1)
    if spacy_model.startswith("blank:"):
        nlp = spacy.blank(spacy_model.replace("blank:", ""))
    else:
        nlp = spacy.load(spacy_model)
    log(f"RECIPE: Creating TextClassifier with model {spacy_model}")
    model = TextClassifier(nlp, label, long_text=long_text, init_tok2vec=init_tok2vec)

    stream = get_stream(
        source, loader=loader, rehash=True, dedup=True, input_key="text"
    )
    if patterns is None:
        predict = model
        update = model.update
    else:
        matcher = PatternMatcher(
            model.nlp,
            prior_correct=5.0,
            prior_incorrect=5.0,
            label_span=False,
            label_task=True,
            filter_labels=label,
            combine_matches=True,
            task_hash_keys=("label",),
        )
        matcher = matcher.from_disk(patterns)
        log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
        # Combine the textcat model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        predict, update = combine_models(model, matcher)
    
    def stream_pre_annotated(stream, model):

        nlp = model

        options = [
                    {"id":"OPTION_1", "text": "Option 1"},
                    {"id":"OPTION_2", "text": "Option 2"},
                    {"id":"OPTION_3", "text": "Option 3"},
                ]

        for task in stream:

            options_accepted = []

            if task['score'] >= 0.5:
                yield {
                    "text": task['text'],
                    "options": options,
                    "accept": [task['label']]
                } 
            else:
                yield {
                    "text": task['text'],
                    "options": options,
                } 
 

    stream = prefer_uncertain(predict(stream))

    stream = stream_pre_annotated(stream, model)

    return {
        "view_id": "choice",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "update": update,
        "config": {
            "lang": nlp.lang, 
            "labels": model.labels,
            },
    }

And this can be run like so:

 prodigy textcat_multi your_dataset ./models/your_trained_textcat_model ./data/unlabeled_data.jsonl --label OPTION_1,OPTION_2,OPTION_3 -F ./src/multi_textcat_teach.py

There's surely some redundant code in here, but I don't have enough experience yet with the latest version of Prodigy's textcat recipe to trim it down. Hope it helps someone.

2 Likes

@trevorwelch Thanks so much for sharing your code! :+1: If you end up running some experiments, I'd definitely be interested in how you go and what works best, how much it takes for the model to converge and produce more relevant suggestions etc :slightly_smiling_face:

1 Like

I was wondering, is it possible, after annotating with your implementation of textcat_multi, to correct/teach the "your_dataset" with patterns, using textcat.teach? Is it going to affect the teaching from textcat_multi?
Also, you have probably used this textcat recipe, so i'm asking how many annotations did you actually do?
Oh yeah, and can't thank you enough for sharing your code, i'm looking forward using it for my project as well