Is there a way to highlight seeded terms in textcat.teach?

Hello, dealing with 1-2 pages of text per classification.

Criteria is simple enough, but hard to spot as the line breaks have been removed. Is there a way to highlight seeded terms in each text item?

Yes, but it’s not yet supported out-of-the–box. (The main problem at the moment is that the seed terms logic only returns the tasks but not the actual match.) However, we’ve been rewriting the textcat.teach to use the PatternMatcher and make it consistent with ner.teach. You’ll then be able to set the --patterns argument with a match patterns file, which should also give you a lot more flexibility than just string matches via seed terms. The matched spans are then highlighted in the same style as named entities.

Here’s an updated version of textcat.teach that you could try:

@recipe('textcat.teach',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        label=recipe_args['label'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        patterns=recipe_args['patterns'],
        long_text=("Long text", "flag", "L", bool),
        exclude=recipe_args['exclude'])
def teach(dataset, spacy_model, source=None, label='', api=None, patterns=None,
          loader=None, long_text=False, exclude=None):
    """
    Collect the best possible training data for a text classification model
    with the model in the loop. Based on your annotations, Prodigy will decide
    which questions to ask next.
    """
    log('RECIPE: Starting recipe textcat.teach', locals())
    DB = connect()
    nlp = spacy.load(spacy_model, disable=['ner', 'parser'])
    log('RECIPE: Creating TextClassifier with model {}'
        .format(spacy_model))
    model = TextClassifier(nlp, label.split(','), long_text=long_text)
    stream = get_stream(source, api, loader, rehash=True, dedup=True,
                        input_key='text')
    if patterns is None:
        predict = model
        update = model.update
    else:
        matcher = PatternMatcher(model.nlp, prior_correct=5., prior_incorrect=5.)
        matcher = matcher.from_disk(patterns)
        log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
        # Combine the textcat model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        predict, update = combine_models(model, matcher)
    # Rank the stream. Note this is continuous, as model() is a generator.
    # As we call model.update(), the ranking of examples changes.
    stream = prefer_uncertain(predict(stream))
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': stream,
        'exclude': exclude,
        'update': update,
        'config': {'lang': nlp.lang, 'labels': model.labels}
    }

Also make sure to import the PatternMatcher:

from ..models.matcher import PatternMatcher

Here are some related threads that you might find helpful as well:

Awesome, thanks!

Just released v1.4.0, which lets you bootstrap textcat.teach with a patterns file instead of only seed terms (just like ner.teach) :tada:

1 Like

Hello,

This is an old thread, but I have basically the same question:

I am annotating for target-dependent sentiment classification. Accordingly, I need to annotate a given document (e.g. in textcat) with regard to a specific target.

My plan for this is to highlight the "target" spans that the document-level annotation should condition on.

For example, if I wanted to annotate the following document for sentiment targeting "apples"

I hate apples, but I love pears

I'd want to highlight the span for "apples" (and tag this document as positive, etc).

Using textcat.manual, I can do this easily by providing the spans that I want to highlight. However, when I provide spans to textcat.teach, they are not rendered.

Accordingly, I took a look at the textcat_teach.py recipe and found this:

matcher = PatternMatcher(
            nlp,
            prior_correct=5.0,
            prior_incorrect=5.0,
            label_span=False,
            label_task=True,
        )

If I set label_span=True in a custom recipe, I can get pattern matches to be highlighted. Unfortunately, this also forces the task label to match the pattern label.

So, for example, I might have a pattern file like:

{"label":"APPLE","pattern":[{"lower":"apple"}]}

Using label_span=True, "apple" would be highlighted in my example doc. However, even if I set --label POSITIVE, the task label shows up as 'APPLE'.

Is there an easy way to fix this?

Also, I realize there have been several other questions related to highlighting spans with textcat.teach. However, they are all over a year old and it seems like the "best" way to accomplish this might have changed during updates.

Thanks!

EDIT: I've also noticed that using textcat.teach with label_span=True seems to serve documents with a pattern match twice. First with the span highlighted and then again with the span not highlighted. Running through some toy examples, it seems like annotations for the document with pattern match highlights stores the span for the pattern match. And annotations for documents without the span highlighted do not store a span match.

This all seems a bit weird, since only a single version of the document exists in the input data.

@JoeEHoover Hi! In case you haven't seen it yet, you can find the documentation of the PatternMatcher and its settings here: Components and Functions · Prodigy · An annotation tool for AI, Machine Learning & NLP

The "label" in the pattern refers to the label the pattern refers to – so this would always be APPLE. Prodigy v1.9 adds a filter_labels setting to the PatternMatcher that takes the list of labels assigned on the command line. So --label POSITIVE would mean that the label you're annotating is POSITIVE, and you'd only see pattern matches for "label": "POSITIVE". This lets you reuse pattern files across recipes without seeing irrelevant matches.

This could happen in previous versions of Prodigy if both the model and the pattern matcher produced a suggestion. Because the task with the highlighted span received a different task hash (due to the span), it wasn't considered a duplicate. This is now resolved by the task_hash_keys setting, which lets you defines the task keys used to generate the task hash. For instance, setting it to ["text"] means that only the text value is considered. combine_matches=True ensures that you only ever see one suggestion per label, even if the same text contains multiple matches. Both of these settings are enabled by default for textcat.teach.