Can we bring back --seeds for textcat.teach?

claycwardell · December 11, 2022, 11:34pm

I understand that --patterns is supposed to be the more powerful version of --seeds. But, my experience in trying to use it, along with the experience of others documented here, suggests that it doesn't work as well in some cases. We know that --seeds worked pretty well, as it was the basis of this tutorial, where Ines shows us she can get great results in only an hour or so using --seeds. When I try to to emulate that tutorial using --patterns, by exporting the terms dataset to a patterns file, I get the issues that have been described by others: not showing the terms often enough, leading to unbalanced, mostly negative labels, eventually making the model converge towards zero.

There are work-arounds, but one simple possibility would be to bring back --seeds alongside the newer --patterns functionality and let users choose the one that works better for their task.

koaning · December 13, 2022, 10:29am

I wasn't around when this change was introduced (I believe it was v1.4.0 that made this change) but I think it is true that pattern files are more flexible. To quote the example from the docs:

{"pattern": [{"lemma": "acquire"}, {"pos": "PROPN"}], "label": "COMPANY_SALE"}
{"pattern": "acquisition", "label": "COMPANY_SALE"}

If we only had seed terms then we could only match on string matches, but these pattern files can also leverage lemmas, parts of speech, and other useful features from the token based matcher in spaCy.

I'd like to understand your problem a bit better so that I may give advice. What are you trying to predict? Is there anything special about your patterns file? What behavior are you seeing and what did you expect?

claycwardell · December 13, 2022, 3:21pm

@koaning thank you for being on top of my (many) questions.

I think the best way to illustrate the difficulty is, try to follow along with Ines' tutorial here, but with patterns file instead of seeds. When you get to the part where she passes seeds to her classification model, you have to export your terms to a patterns file and pass that. What is interesting is when I did this, I didn't get the same relevant labelling examples she got when she passed seeds. I only got a few insult examples in hundreds of non-insults when I tried.

Maybe it's the seeds -> pattern transition, or maybe it's something else I'm doing wrong. A lot of the API has changed since she made that video. Maybe an easier ask, rather than bringing back --seeds, is this: can we get a binary textcat tutorial video or document, which achieves a great model like Ines', using the current version of prodigy and spacy? I could follow along from there figure out what I'm doing wrong.

ryanwesslen · December 15, 2022, 1:55pm

hi @claycwardell!

Thanks for your comment and your feedback! We greatly appreciate users' ideas and I've written an internal note for our engineering team. They'll take this into consideration as I think we may be rethinking the design of some of the recipes too.

If you're not aware, you can look at Python recipes locally so you could even try to experiment. To do this, run python -m prodigy stats and find your Location:. From there, look for the recipes folder.

Thanks for this feedback too! I'll also forward this to our community team so we can consider this in the new year.

honnibal · December 19, 2022, 12:57pm

@claycwardell As Ryan said, thanks for the feedback on this . I agree that we should get updates to the videos and tutorials, we're working on that.

While I don't want to rule out that it's a simpler problem, here's a bit of context around why matching everything up with earlier versions isn't always easy.

Machine learning techniques have continued to develop over the time we've been developing Prodigy, and we've obviously wanted to continue to keep things moving forward. However sometimes a change that's better in total is worse for either a particular dataset, or even for a whole workflow in general.

I'll give you a quick sketch of the main development trend that's mattered here, going back to way deeper history than is really necessary .

Early versions of spaCy (pre v2.0) used non-neural network statistical models, which relied on boolean indicator features weighed by linear models. The linear models actually performed fairly well for English, and were very fast. Downsides included high memory usage, poor cross-domain transfer, and inability to use transfer learning.

Transfer learning is a big advantage in Prodigy's context, because it greatly reduces the total number of annotations needed to get to some level of accuracy. The simplest type of transfer learning is pretrained word vectors. Over the last couple of years, transfer learning of contextual word representations also works extremely well.

However, a neural network with transfer learning behaves very differently from an indicator-based linear model at the start of training. Neural networks start with a random initialization, and it takes a few batches of updates to move them into something helpful. Optimizers also work best if you let them take large steps at the beginning of training. There's certainly a big literature on online learning where the cost of every update matters. But the architectures and optimizers there are different, so it's difficult to reuse work from elsewhere.

Beta versions of Prodigy were developed against spaCy v1.0, but we've been using the neural network models since the earliest releases. To make the textcat.teach recipe work, the trick that I developed was to use an ensemble. The ensemble combines predictions from a linear model and a neural network. The idea is that the linear model learns quickly from its sparse features. It starts off not knowing anything about anything, and if it does see a (word, category) pair that it's seen an example of, it will reliably label the example with that category. So you get this nice responsivity at the beginning. Over time the neural network model then takes over.

The problem is that it's difficult to make sure that these dynamics play out as intended, robustly on a number of datasets, as spaCy versions develop and model architectures have continued to adapt. It's a difficult thing to unit test.

Again, I don't want to rule out a simpler problem, where it's something about labels not lining up, or an outright bug in the matching code, etc. But it's also possible that the machine learning dynamics are simply working a bit worse than they used to, or that they only work on some problems but not on others.

For Prodigy v2, we want to treat recipes like textcat.teach that need specific models and experimentation differently from how we've been doing them. We should break these recipes out and define their machine learning architectures together with the recipe, ideally also with project files that do the necessary experimentation. This will make things much more reproducible, and let users customize things on a per-use-case basis much more easily.

claycwardell · December 19, 2022, 7:02pm

Cool, thanks for the explanation. My main take-away from that is: Spacy and Prodigy have become more powerful, generalizable, and efficient in general since the early days of Prodigy, but that doesn't mean that they will work better for every single use-case. And reading between the lines a bit, it seems that maybe one of the use-cases where the new versions of Spacy and Prodigy don't work as well as they used to is the subject of Ines' original textcat tutorial video, where you train a binary classifier on a relatively small amount of data using active learning.

Is that the correct interpretation here? If so, I'd have two main reactions:

That's cool. That's how software works sometimes.
A little more transparency around this fact, especially with regards to the docs and that original tutorial video, would be good.

honnibal · December 19, 2022, 10:56pm

We actually need to investigate more to make sure that this is the case for that specific tutorial. If that's the explanation, and it's not some sort of easily fixed problem, then yes we'll indeed update the docs and tutorial. I also agree that we should've looked more carefully at this sooner. Again, thanks for flagging it.

honnibal · February 10, 2023, 2:09pm

Finally able to resolve this. High level summary:

It's in fact not the learning dynamics, fundamentally. The learning dynamics are indeed different, but they were masking a different bug.
The bug is fundamentally in the way the pattern matcher and model predictions were combined. This code just didn't do what it was supposed to do, and the result was that the patterns had little impact on the enqueued data.
A side issue we encountered is that depending on the pipeline components you have running, following something closer to the original logic can cause performance problems.

The old recipe

I'm mostly pasting this for illustration -- it has imports etc missing so it won't be easy to run verbatim; if you do want to run it I can definitely give you it in a runnable form. But see below for a different suggestion.

def find_with_terms(stream, terms, at_least=0, at_most=100,
                    give_up_after=100000):
    """
    Yield examples that have at least one substring matching the given
    terms. You can specify a maximum number of matches, and/or a maximum number
    of examples to search through. You can also specify a minimum number of
    examples. If fewer are found, an error is raised.
    """
    log("SORTER: Looking for at least {} examples containing {} seed terms"
        .format(at_least, len(terms)))
    n_matches = 0
    for i, example in enumerate(stream):
        words = set(example['text'].split())
        if any(term in words for term in terms):
            n_matches += 1
            yield example
        if n_matches >= at_most:
            log("SORTER: Stop after finding maximum of {} examples from seeds"
                .format(at_most))
            break
        if i >= give_up_after:
            log("SORTER: Give up finding seed terms after {} examples"
                .format(give_up_after))
            break
    if n_matches < at_least:
        msg = ("Tried to find at least %d examples containing the %d seed "
               "terms provided, but only found %d matches. Gave up after "
               "searching %d examples from the stream.")
        raise ValueError(msg % (at_least, len(terms), n_matches,
                                give_up_after))


@recipe('textcat.teach',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        label=recipe_args['label'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        seeds=recipe_args['seeds'],
        long_text=("Long text", "flag", "L", bool),
        exclude=recipe_args['exclude'])
def teach(dataset, spacy_model, source=None, label='', api=None,
          loader=None, seeds=None, long_text=False, exclude=None):
    """
    Collect the best possible training data for a text classification model
    with the model in the loop. Based on your annotations, Prodigy will decide
    which questions to ask next.
    """
    log('RECIPE: Starting recipe textcat.teach', locals())
    nlp = spacy.load(spacy_model)
    log('RECIPE: Creating TextClassifier with model {}'
        .format(spacy_model))
    model = TextClassifier(nlp, label.split(','), long_text=long_text)
    stream = get_stream(source, api, loader, rehash=True, dedup=True,
                        input_key='text')
    if seeds is not None:
        if isinstance(seeds, str) and seeds in DB:
            seeds = get_seeds_from_set(seeds, DB.get_dataset(seeds))
        else:
            seeds = get_seeds(seeds)
        # Find 'seedy' examples
        examples_with_seeds = list(find_with_terms(stream, seeds,
                                   at_least=10, at_most=1000,
                                   give_up_after=10000))
        for eg in examples_with_seeds:
            eg.setdefault('meta', {})
            eg['meta']['via_seed'] = True
        print("Found {} examples with seeds".format(len(examples_with_seeds)))
        examples_with_seeds = [task for _, task in model(examples_with_seeds)]
    # Rank the stream. Note this is continuous, as model() is a generator.
    # As we call model.update(), the ranking of examples changes.
    stream = prefer_uncertain(model(stream))
    # Prepend 'seedy' examples, if present
    if seeds:
        log("RECIPE: Prepending examples with seeds to the stream")
        stream = cytoolz.concat((examples_with_seeds, stream))
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': stream,
        'exclude': exclude,
        'update': model.update,
        'config': {'lang': nlp.lang, 'labels': model.labels}
    }

Analysis

What we're doing here is searching for a set of examples that have a high density of our seed terms. We then prepend this set onto the stream:

# Find 'seedy' examples
examples_with_seeds = list(find_with_terms(stream, seeds,
       at_least=10, at_most=1000,
        give_up_after=10000))
        ...
stream = cytoolz.concat((examples_with_seeds, stream))

It very much bothers me that this recipe has had this bug for so long, so to me it's worth a broader look at what's gone wrong. I think fundamentally, there's no reason for all these steps to happen in one function. We should break this down into different steps: have a recipe to filter the data to find good candidates for annotation, and then enqueue that data in a separate recipe.

Naturally it's already possible to work this way with Prodigy, and I think that's what a lot of users have been doing (which might also be why this bug wasn't more prominent). So going forward, the multi-step approach is what we'll be recommending. We'll also be recording new tutorials with the v2 release we're working on.

Separate filtering recipe

@koaning has written a recipe to filter examples by pattern. You could also easily modify this to work with substring matches, or use some other heuristic. We expect to publish this with v1.12, but here it is (along with Vincent's description) so you can try it out already if you want.

Here's a demo recipe, written in a file called filter-recipe.py that can filter candidates by re-using parts of the previous recipe:

from typing import Iterable, List, Optional, Union

import toolz
import srsly
from prodigy.components.loaders import get_stream
from prodigy.core import recipe
from prodigy.models.matcher import PatternMatcher
from prodigy.types import RecipeSettingsType, StreamType
from prodigy.util import get_labels, load_model, log, msg, split_string
from spacy.language import Language


def find_matches(
    matcher: PatternMatcher,
    stream: StreamType,
    at_least: int = 10,
    at_most: int = 1000,
    give_up_after: int = 10000,
):
    """
    Yield examples that have at least one matched pattern.
    You can specify a maximum number of matches, and/or a maximum number
    of examples to search through. You can also specify a minimum number of
    examples. If fewer are found, an error is raised.
    """
    log(f"SORTER: Looking for at least {at_least} examples containing patterns")
    n_matches = 0
    for _, example in matcher(toolz.take(give_up_after, stream)):
        # PatternMatcher.__call__ only yields examples that matched the pattern
        assert example["spans"]
        n_matches += 1
        yield example
        if n_matches >= at_most:
            log(
                f"SORTER: Stopped after finding maximum of {at_most} examples from patterns"
            )
            break
    else:
        log(f"SORTER: Gave up finding pattern matches after {give_up_after} examples")
    if n_matches < at_least:
        warning_msg = (
            f"This recipe is using patterns to find candidates to annotate first. It "
            f"tried to find at least {at_least} examples containing one of the patterns "
            f"provided, but only found {n_matches} matches. It gave up searching "
            f"searching {give_up_after} examples from the stream and the recipe will now "
            f"continue by considering candidates that did not get matched."
        )
        msg.warn(warning_msg)


@recipe(
    "custom.filter",
    # fmt: off
    source=("Data to filter (file path or '-' to read from standard input)", "positional", None, str),
    output=("Path to .jsonl file to write subset into", "positional", None, str),
    spacy_model=("Loadable spaCy pipeline or blank:lang (e.g. blank:en)", "positional", None, str),
    loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    patterns=("Path to match patterns file", "option", "pt", str),
    patterns_at_least=("If patterns are provided, the minimum number of matches", "option", "al", int),
    patterns_at_most=("If patterns are provided, the maximum number of matches", "option", "am", int),
    patterns_give_up_after=("If patterns are provided, when to stop searching", "option", "gu", int)
    # fmt: on
)
def filter_feed(
    source: Union[str, Iterable[dict]],
    output: str,
    spacy_model: Union[str, Language],
    label: Optional[List[str]] = None,
    patterns: Optional[str] = None,
    loader: Optional[str] = None,
    patterns_at_least: int = 10,
    patterns_at_most: int = 100,
    patterns_give_up_after: int = 2000,
) -> RecipeSettingsType:
    """
    Filter the dataset based on a patterns.jsonl file
    """
    log("RECIPE: Starting recipe custom.filter", locals())
    if label is None:
        msg.fail("custom.filter requires at least one --label", exits=1)
    if patterns is None:
        msg.fail("custom.filter requires --patterns", exits=1)
    
    nlp = load_model(spacy_model)
    stream = get_stream(
        source, loader=loader, rehash=True, dedup=True, input_key="text"
    )
    matcher = PatternMatcher(
        nlp,
        prior_correct=5.0,
        prior_incorrect=5.0,
        label_span=False,
        label_task=True,
        filter_labels=label,
        combine_matches=True,
        task_hash_keys=("label",),
    )
    matcher = matcher.from_disk(patterns)
    log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
    matched_stream = find_matches(
        matcher,
        stream,
        at_least=patterns_at_least,
        at_most=patterns_at_most,
        give_up_after=patterns_give_up_after,
    )

    srsly.write_jsonl(output, matched_stream)

This recipe has the following --help.

usage: prodigy custom.filter [-h] [-l None] [-pt None] [-lo None] [-al 10] [-am 100] [-gu 2000] source output spacy_model

    Filter the dataset based on a patterns.jsonl file
    

positional arguments:
  source                Data to filter (file path or '-' to read from standard input)
  output                Path to .jsonl file to write subset into
  spacy_model           Loadable spaCy pipeline or blank:lang (e.g. blank:en)

optional arguments:
  -h, --help            show this help message and exit
  -l None, --label None
                        Comma-separated label(s) to annotate or text file with one label per line
  -pt None, --patterns None
                        Path to match patterns file
  -lo None, --loader None
                        Loader (guessed from file extension if not set)
  -al 10, --patterns-at-least 10
                        If patterns are provided, the minimum number of matches
  -am 100, --patterns-at-most 100
                        If patterns are provided, the maximum number of matches
  -gu 2000, --patterns-give-up-after 2000
                        If patterns are provided, when to stop searching

And can be called via something like:

python -m prodigy custom.filter examples.jsonl interesting-subset.jsonl en_core_web_md --patterns patterns.jsonl --label insult -F filter-recipe.py

I personally prefer the two-step approach for a few reasons.

It's more explicit. Instead of having one recipe that tries to apply two techniques we're able to split that up into two annotation tasks that can each contribute useful training data.
This filtering technique is useful for multiple recipes, not just textcat. You could use interesting-subset.jsonl for whatever use-case.
The "patterns tactic" for active learning seem very valuable early on in the active learning process, but in later phases it feels like the uncertain examples from the textcat model deserve more attention. These are the examples that the model finds confusing, so at some point the model might be able to learn most from these examples.
This filtering approach might invite users to iterate on their patterns files, which seems worth encouraging

I hope that by sharing these recipes we help solve your current issues, but also that it helps motivate why the current recipe is the way that it is. I also hope that this reply might inspire some custom ideas around annotation techniques as well. The cool thing about Prodigy is that it is customisable from Python and you're free to implement any filtering/active learning mechanism that might work well for your problem.

If there are follow up questions/comments: do let us know

Topic		Replies	Views
Pattern files for textcat.teach usage , textcat	20	3749	July 6, 2018
Seeding text categorization with phrases textcat , done , custom	9	4205	March 21, 2018
Seeds not recognized by textcat.teach textcat , solved	10	3278	January 23, 2019
Textcat.teach not using the pattern file enhancement , textcat , done	10	1917	September 20, 2022
How textcat.teach works under the hood usage , textcat	16	94	March 26, 2025

Can we bring back --seeds for textcat.teach?

The old recipe

Analysis

Separate filtering recipe

Related topics