Highlight list of terms in `textcat.manual` for binary annonation


I am annotating text for a binary classification task and I would like to make it so that a set of words that I specify appears highlighted in the Prodigy session to make annotation easier. I am using Prodigy v1.8.5. What I have tried to do is modify the textcat.manual recipe on Github using patterns but after many hours or trial-and-error, I still get a "No tasks available" message on the Prodigy session. Below is the code for the textcat.manual recipe along with the code I added (marked with # ADDED).

from typing import List, Optional
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import split_string
from prodigy.models.matcher import PatternMatcher # ADDED
import spacy # ADDED: v.2.1.9

# Helper functions for adding user provided labels to annotation tasks.
def add_label_options_to_stream(stream, labels):
    options = [{"id": label, "text": label} for label in labels]
    for task in stream:
        task["options"] = options
        yield task

def add_labels_to_stream(stream, labels):
    for task in stream:
        task["label"] = labels[0] # ADDED: orginal code has `label[0]` instead of `labels[0]` which is likely an error
        yield task

# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
    "textcat.manual_pattern", # ADDED: original was "textcat.manual"
    dataset=("The dataset to use", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    exclusive=("Treat classes as mutually exclusive", "flag", "E", bool),
    exclude=("Names of datasets to exclude", "option", "e", split_string),

# ADDED: original was `def textcat_manual(`
def textcat_manual_pattern( 
    dataset: str,
    source: str,
    label: Optional[List[str]] = None,
    exclusive: bool = False,
    exclude: Optional[List[str]] = None,
    Manually annotate categories that apply to a text. If more than one label
    is specified, categories are added as multiple choice options. If the
    --exclusive flag is set, categories become mutually exclusive, meaning that
    only one can be selected during annotation.

    # Load the stream from a JSONL file and return a generator that yields a
    # dictionary for each example in the data.
    stream = JSONL(source)

    # ADDED: Adding patterns
    nlp = spacy.blank("en") # ADDED
    matcher = PatternMatcher(nlp, label_span=True) # ADDED
    patterns = [{"label" : "LABEL1", "pattern" : "pattern1"}] # ADDED
    matcher.add_patterns(patterns) # ADDED
    stream = matcher(stream) # ADDED

    #Add labels to each task in stream
    has_options = len(label) > 1
    if has_options:
        stream = add_label_options_to_stream(stream, label)
        stream = add_labels_to_stream(stream, label)

    return {
        "view_id": "choice" if has_options else "classification",  # Annotation interface to use
        "dataset": dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "exclude": exclude,  # List of dataset names to exclude
        "config": {  # Additional config settings, mostly for app UI
            "choice_style": "single" if exclusive else "multiple", # Style of choice interface
            "exclude_by": "input" if has_options else "task", # Hash value used to filter out already seen examples

Am I missing something obvious? Is this just not possible in my version of Prodigy? I'd appreciate any insights on the matter.

Hi! What you're doing here should definitely work but maybe there's a different problem here? Did you double-check that the examples you're annotating aren't in the current dataset you're saving to? Because in that case, Prodigy will skip the examples so you're only asked a question once. Alternatively, you can also double-check that your input file has the correct format – you can run Prodigy with the PRODIGY_LOGGING=basic env variable to see if anything was skipped.

data.jsonl (70 Bytes)

Hi Ines,
Thanks for your suggestions! I gave them a try and I get the impression that something is indeed going on with the data. In particular, I ran the following command

PRODIGY_LOGGING=basic prodigy textcat.manual_pattern testing_highlight data.jsonl --label LABEL1 -F textcat_pattern.py

where textcat_pattern.py is the script I published in the original question, data.jsonl is a sample data file I tried and is attached. The output was:

15:21:23 - APP: Using Hug endpoints (deprecated)
15:21:23 - RECIPE: Loading recipe from file textcat_pattern.py
15:21:23 - RECIPE: Calling recipe 'textcat.manual_pattern'
15:21:24 - MODEL: Adding 1 patterns
15:21:24 - CONTROLLER: Initialising from recipe
15:21:24 - VALIDATE: Creating validator for view ID 'classification'
15:21:24 - DB: Initialising database SQLite
15:21:24 - DB: Connecting to database SQLite
15:21:24 - DB: Creating dataset 'testing_highlight'
Added dataset testing_highlight to database SQLite.
15:21:24 - DB: Loading dataset 'testing_highlight' (0 examples)
15:21:24 - DB: Creating dataset '2022-04-20_15-21-24'
15:21:24 - DatasetFilter: Getting hashes for excluded examples
15:21:24 - DatasetFilter: Excluding 0 tasks from datasets: testing_highlight 
15:21:24 - CONTROLLER: Initialising from recipe
15:21:24 - CORS: initialize wildcard "*" CORS origins

It seems that Prodigy is loading 0 examples, which is most likely the reason for the "No tasks available." message. Below you can find the contents of my prodigy.json file, just to make sure that there is nothing going on infrastructure-wise:

  "theme": "basic",
  "custom_theme": {
    "smallText" : 15,
    "cardMaxWidth" : 900
  "global_css": ".prodigy-content {text-align: left; padding:2;}.c01127 {display: grid;grid-template-columns: 1fr 1fr;grid-gap: 0px;}",
  "buttons": ["accept", "ignore", "undo"],
  "batch_size": 10,
  "history_size": 10,
  "port": 8080,
  "host": "",
  "cors": true,
  "db": "sqlite",
  "db_settings": {},
  "validate": true,
  "auto_exclude_current": true,
  "force_stream_order": true,
  "instant_submit": false,
  "feed_overlap": false,
  "auto_count_stream": false,
  "total_examples_target": 0,
  "ui_lang": "en",
  "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
  "show_stats": false,
  "hide_meta": false,
  "show_flag": false,
  "instructions": false,
  "swipe": false,
  "swipe_gestures": { "left": "accept", },
  "split_sents_threshold": false,
  "html_template": false,
  "javascript": null,
  "writing_dir": "labelled_data",
  "show_whitespace": false,

Software versions:

Prodigy: v1.8.5
spaCy: v2.1.9

Maybe you can see something I am missing. Thanks again for your suggestions!