Adding a text box to a recipe

I am trying to add a simple text box to a recipe specifically the textcat.teach recipe as I want to annotate/classify some text samples and to keep the model in the loop.

I copied the recipe from here and added a blocks variable to the config and also added the pipeline argument to the textclassifier model (as shown here):

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.models.textcat import TextClassifier
from prodigy.models.matcher import PatternMatcher
from prodigy.components.sorters import prefer_uncertain
from prodigy.util import combine_models, split_string
import spacy
from typing import List, Optional


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "textcat.teach.BOX",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    patterns=("Optional match patterns", "option", "p", str),
    exclude=("Names of datasets to exclude", "option", "e", split_string),
)
def textcat_teach(
    dataset: str,
    spacy_model: str,
    source: str,
    label: Optional[List[str]] = None,
    patterns: Optional[str] = None,
    exclude: Optional[List[str]] = None,
):
    """
    Collect the best possible training data for a text classification model
    with the model in the loop. Based on your annotations, Prodigy will decide
    which questions to ask next.
    """
    blocks = [
        {"view_id": "text_input", "field_rows": 3, "field_label": "Explain your decision"}
    ]
    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    stream = JSONL(source)

    # Load the spaCy model
    nlp = spacy.load(spacy_model)

    # Initialize Prodigy's text classifier model, which outputs
    # (score, example) tuples
    model = TextClassifier(nlp, label, pipe_name="textcat")

    if patterns is None:
        # No patterns are used, so just use the model to suggest examples
        # and only use the model's update method as the update callback
        predict = model
        update = model.update
    else:
        # Initialize the pattern matcher and load in the JSONL patterns.
        # Set the matcher to not label the highlighted spans, only the text.
        matcher = PatternMatcher(
            nlp,
            prior_correct=5.0,
            prior_incorrect=5.0,
            label_span=False,
            label_task=True,
        )
        matcher = matcher.from_disk(patterns)
        # Combine the NER model and the matcher and interleave their
        # suggestions and update both at the same time
        predict, update = combine_models(model, matcher)

    # Use the prefer_uncertain sorter to focus on suggestions that the model
    # is most uncertain about (i.e. with a score closest to 0.5). The model
    # yields (score, example) tuples and the sorter yields just the example
    stream = prefer_uncertain(predict(stream))
    
    return {
        "view_id": "classification",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "update": update,  # Update callback, called with batch of answers
        "exclude": exclude,  # List of dataset names to exclude
        "config": {"lang": nlp.lang, "blocks": blocks},  # Additional config settings, mostly for app UI
    }

but when I try to run:

python -m prodigy textcat.teach.BOX news_groups blank:en newsgroups_space.txt --label NODULE --patterns nodule_patterns.jsonl -F text_cat_with_box.py

I get:

  File "text_cat_with_box.py", line 48, in textcat_teach
    model = TextClassifier(nlp, label, pipe_name="textcat")
  File "cython_src\prodigy\models\textcat.pyx", line 90, in prodigy.models.textcat.TextClassifier.__init__
  File "cython_src\prodigy\models\textcat.pyx", line 23, in prodigy.models.textcat.infer_exclusive
ValueError: Can't infer exclusive vs. non-exclusive categories from 'textcat': not in the pipeline. Available: []

How would I add a simple text box where the annotator can give a reason to their choice for this recipe? I also tried just pasting the code directly and running it and it gives the same error. Any ideas what could be happening here?

Hi! It looks like the problem here is that you're using a blank:en pipeline with no text classifier, so there's nothing that the recipe can use to predict the initial categories. One thing you can do in your recipe is to make sure a text classifier is added and has the correct labels:

from prodigy.models.textcat import add_text_classifier

# in your recipe
add_text_classifier(nlp, label)

That said, this will start you off with a blank text classifier, the model will know essentially nothing and it might take you a lot longer to get to a state where it can make useful suggestions. So if possible, you ideally want to start off with a text classifier that was trained on at least a small sample of manually annotated data. For example, you could run textcat.manual, collect a few representative examples, train your model with prodigy train and then use that in your custom textcat.teach workflow to improve it further.

Hi Ines,

Thank you for the insight. I think there are discrepancies between the internal recipes and recopies that exist on explosions github? We can see that this textcat manual is different, it doesn't have the --loader parameter. I really just want to slightly modify the built-in recipe such that I can add 1 extra block (a text box) but I cant seem to find the built-in recipes?

Thank you

OKAY, I figured out how to accomplish this!
First I had to check out the builtin recipie which can be found by doing this: python -c "import prodigy;print(prodigy.__file__)" and can be found here

I literally took the textcat.manual part of the textcat.py file and modified it to

  1. add 2 components to the block (the actual text to annotate and the input box)
  2. include the blocks
  3. add the block as the view id

Here is what my code looks like:

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.models.textcat import TextClassifier
from prodigy.models.matcher import PatternMatcher
from prodigy.components.sorters import prefer_uncertain
from prodigy.util import combine_models, split_string, get_labels, log
from prodigy.components.loaders import get_stream
from prodigy.components.preprocess import add_label_options, add_labels_to_stream
from prodigy.types import TaskType, StreamType, RecipeSettingsType
from typing import List, Optional, Union, Iterable
import spacy
from typing import List, Optional
@prodigy.recipe(
    "textcat.manual.BOX",
    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
    loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    exclusive=("Treat classes as mutually exclusive (if not set, an example can have multiple correct classes)", "flag", "E", bool),
    exclude=("Comma-separated list of dataset IDs whose annotations to exclude", "option", "e", split_string),
    # fmt: on
)
def manual(
    dataset: str,
    source: Union[str, Iterable[dict]],
    loader: Optional[str] = None,
    label: Optional[List[str]] = None,
    exclusive: bool = False,
    exclude: Optional[List[str]] = None,
) -> RecipeSettingsType:
    """
    Manually annotate categories that apply to a text. If more than one label
    is specified, categories are added as multiple choice options. If the
    --exclusive flag is set, categories become mutually exclusive, meaning that
    only one can be selected during annotation.
    """
    
    log("RECIPE: Starting recipe textcat.manual", locals())
    labels = label
    if not labels:
        msg.fail("textcat.manual requires at least one --label", exits=1)
    has_options = len(labels) > 1
    log(f"RECIPE: Annotating with {len(labels)} labels", labels)
    stream = get_stream(
        source, loader=loader, rehash=True, dedup=True, input_key="text"
    )
    blocks = [
        {"view_id": "choice" if has_options else "classification"},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Explain your decision"}
    ]
    if has_options:
        stream = add_label_options(stream, label)
    else:
        stream = add_labels_to_stream(stream, label)
        if exclusive:
            # Use the dataset to decide what's left to annotate
            db = connect()
            if dataset in db:
                stream = filter_accepted_inputs(db.get_dataset(dataset), stream)

    return {
        #"view_id": "choice" if has_options else "classification",
        "view_id": "blocks", 
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "config": {
            "labels": labels,
            "choice_style": "single" if exclusive else "multiple",
            "choice_auto_accept": exclusive,
            "exclude_by": "input" if has_options else "task",
            "auto_count_stream": True,
            "blocks": blocks,
        },
    }
1 Like

Glad you got it working!

Yes, the versions of the recipes in the prodigy_recipes repo are slightly modified and simplified so they work better as templates to start from and modify, and contain less "magic" than the built-in recipes, which need to deal with all kinds of input etc.

Thank again!