textcat.teach presents same annotation task if text snippet contains multiple patterns

Hi Ines,

This may be by design but looking for guidance on best approach to annotation. I have a collection of patterns (building something similar to your Insult Classifier) and I’m using a Reddit dataset to annotate positive/negative examples. If one text snippet contains multiple of the patterns, I see it that many times, each time with a different one of the terms highlighted from my patterns (and then one more time, with nothing highlighted). It almost seems like the task is to annotate just the portion of the text with the highlighted word if I’m seeing the same task over and over.

Should I annotate each comment in the same way to be consistent? (If so, why the multiple instances of same task?) Or am I actually supposed to be annotating a smaller portion of the comment – the one containing that word? Note: this is all happening within a single session of annotating.


Thank you!

Hi! This is currently expected behaviour because the pattern matcher just yields out every result – but you’re right that it’s not very practical and we probably want to change this and make “one match per example” the default behaviour.

If you’re annotating for text classification, you’re giving feedback on the text plus label. The patterns are mostly a means to an end, so if you do see an example with the correct label, you should accept it.

That said, if you do get a lot of duplicate matches, you could also write a function that keeps track of the original example texts you’ve already seen and only yields out an example once:

def filter_stream(stream):
    seen = set()
    for eg in stream:
        # Get the hash idenfitying the original input, e.g. the text 
        input_hash = eg["_input_hash"]
        if input_hash not in seen:
            yield eg
        seen.add(input_hash)

stream = filter_stream(stream)

Ok, thanks for the info and for the sample code to filter what I’ve already seen. Very helpful!

1 Like

@ines - Thank you for sharing the code example above. I added this filtering to the textcat.teach function right after the stream = get_stream() command. However, when I collect annotations, I am still seeing duplicates when there are multiple patterns associated with the same chunk of text. Am I putting the function you suggested above in the right place?

@recipe('textcat_custom.teach',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        label=recipe_args['label_set'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        patterns=recipe_args['patterns'],
        long_text=("Long text", "flag", "L", bool),
        exclude=recipe_args['exclude'])
def teach(dataset, spacy_model, source=None, label=None, api=None,
          patterns=None, loader=None, long_text=False, exclude=None):
    """                         
    Collect the best possible training data for a text classification model
    with the model in the loop. Based on your annotations, Prodigy will decide
    which questions to ask next.
    """
    log('RECIPE: Starting recipe textcat.teach', locals())
    if label is None:
        prints("No label specified", "To use the textcat.teach recipe, you "
               "need to provide at least one category label via the --label "
               "or -l argument.", error=True, exits=1)

    nlp = spacy.load(spacy_model, disable=['ner', 'parser'])
    log('RECIPE: Creating TextClassifier with model {}'
        .format(spacy_model))
    model = TextClassifier(nlp, label, long_text=long_text)
    stream = get_stream(source, api, loader, rehash=True, dedup=True,
                        input_key='text')
    def filter_stream(stream):
        seen = set()
        for eg in stream:
            # Get the hash idenfitying the original input, e.g. the text 
            input_hash = eg["_input_hash"]
            if input_hash not in seen:
                yield eg
            seen.add(input_hash)

    stream = filter_stream(stream)
    if patterns is None:
        predict = model
        update = model.update
    else:
        matcher = PatternMatcher(model.nlp, prior_correct=5.,
                                 prior_incorrect=5., label_span=False,
                                 label_task=True)
        matcher = matcher.from_disk(patterns)
        log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
        # Combine the textcat model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        predict, update = combine_models(model, matcher)
    # Rank the stream. Note this is continuous, as model() is a generator.
    # As we call model.update(), the ranking of examples changes.

    stream = split_sentences(nlp,stream)
    stream = prefer_uncertain(predict(stream))
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': stream,
        'exclude': exclude,
        'update': update,
        'config': {'lang': nlp.lang, 'labels': model.labels}
    }

@reb-greazy I think you might be adding the stream filter too early! If I read your code correctly, you’re basically adding it before the pattern matches and model suggestions are uncluded, so the filtering doesn’t have any effect.

What happens if you move it to the very end of the function? Like this:

stream = split_sentences(nlp,stream)
stream = prefer_uncertain(predict(stream))
stream = filter_stream(stream)

@ines - Thanks for the quick reply! It is working as expected now :slight_smile:

Hi,

I’m finally working on adding this code snippet into the textcat.teach recipe. I appreciate the question from @reb-greazy because it helped me put the filter_stream() function in the right place. However, I’m still seeing duplicate annotations after adding in this snippet. Is something else amiss here that I’m not seeing?

@recipe("textcat.teach", 
        dataset=recipe_args["dataset"], spacy_model=recipe_args["spacy_model"],
        source=recipe_args["source"], label=recipe_args["label_set"], api=recipe_args["api"], 
        loader=recipe_args["loader"], patterns=recipe_args["patterns"], long_text=("Long text", "flag", "L", 
        bool), exclude=recipe_args["exclude"],)
def teach(dataset, spacy_model, source=None, label=None, api=None, patterns=None, loader=None,
          long_text=False, exclude=None):
"""
Collect the best possible training data for a text classification model
with the model in the loop. Based on your annotations, Prodigy will decide
which questions to ask next.
"""
log("RECIPE: Starting recipe textcat.teach", locals())
if label is None:
    prints(
        "No label specified",
        "To use the textcat.teach recipe, you "
        "need to provide at least one category label via the --label "
        "or -l argument.",
        error=True,
        exits=1,
    )

nlp = spacy.load(spacy_model, disable=["ner", "parser"])
log("RECIPE: Creating TextClassifier with model {}".format(spacy_model))
model = TextClassifier(nlp, label, long_text=long_text)
stream = get_stream(source, api, loader, rehash=True, dedup=True, input_key="text")
if patterns is None:
    predict = model
    update = model.update
else:
    matcher = PatternMatcher(
        model.nlp,
        prior_correct=5.0,
        prior_incorrect=5.0,
        label_span=False,
        label_task=True,
        filter_labels=label,
    )
    matcher = matcher.from_disk(patterns)
    log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
    # Combine the textcat model with the PatternMatcher to annotate both
    # match results and predictions, and update both models.
    predict, update = combine_models(model, matcher)
# Rank the stream. Note this is continuous, as model() is a generator.
# As we call model.update(), the ranking of examples changes.

def filter_stream(stream):
    seen = set()
    for eg in stream:
        # Get the hash idenfitying the original input, e.g. the text 
        input_hash = eg["_input_hash"]
        if input_hash not in seen:
            yield eg
        seen.add(input_hash)

stream = prefer_uncertain(predict(stream))
stream = filter_stream(stream)

return {
    "view_id": "classification",
    "dataset": dataset,
    "stream": stream,
    "exclude": exclude,
    "update": update,
    "config": {"lang": nlp.lang, "labels": model.labels},
}

Your logic looks correct. Do the duplicate examples have the same _input_hash? You could try rehashing the examples, just in case:

from prodigy import set_hashes

def filter_stream(stream):
    seen = set()
    for eg in stream:
        eg = set_hashes(eg, overwrite=True)
        # and so on...

Thanks for this. We realized the recipe I’m editing maybe isn’t the one that’s being called, and we can’t find the one that is, so seems like a problem we have to sort out. I’m guessing your code change suggestions will work once I’m editing the right file. Thanks for your help so far.

Update: I made a custom recipe by just copying the textcat.py file exactly and added in your code changes. The recipe is in the same recipes directory as the original textcat.py file. I had to update the relative import statements from “…” to “prodigy.” because I was getting import errors (“ImportError: attempted relative import with no known parent package”).

I called it successfully with this command:

prodigy filtered-textcat.teach reportable en_core_web_sm [path to dataset] --loader reddit --label REPORTABLE --patterns /tmp/reportable_patterns.jsonl -F [path to recipe]

We suspect that the reason the edits to the original textcat.py file weren’t changing the behavior of textcat.py had something to do with the shared object files in the “models” and “components” directory. Is it possible there’s a compiled version of textcat.py that’s being used somehow that we can’t access? We could be mistaken…any insights about this?

Glad it worked!

No, that shouldn't be happening :thinking: Recipes are just shipped as plain Python files to make it easy to inspect them. So if you check for the location of your Prodigy installation (print(prodigy.__file__)) and edit the source in recipes/ there, it should be reflected in Prodigy. Maybe you have two installations of Prodigy or something?

This makes sense, though – if you load a recipe with -F, it's imported as a separate module. You also don't have to put it in the Prodigy package directory – it can live anywhere on your system. I'd definitely recommend doing that wherever possible, instead of editing the package source. Recipe scripts are standalone Python files, so you can keep them in a repo and put them in version control, share them with others etc.

In case you haven't seen it yet, you might also find our prodigy-recipes repo helpful:

Thanks, as always, for your prompt and detailed responses. I’ll take a look at the prodigy-recipes repo.