Is there a way to highlight seeded terms in textcat.teach?

ines · February 15, 2018, 8:50am

Yes, but it’s not yet supported out-of-the–box. (The main problem at the moment is that the seed terms logic only returns the tasks but not the actual match.) However, we’ve been rewriting the textcat.teach to use the PatternMatcher and make it consistent with ner.teach. You’ll then be able to set the --patterns argument with a match patterns file, which should also give you a lot more flexibility than just string matches via seed terms. The matched spans are then highlighted in the same style as named entities.

Here’s an updated version of textcat.teach that you could try:

@recipe('textcat.teach',
        dataset=recipe_args['dataset'],
        spacy_model=recipe_args['spacy_model'],
        source=recipe_args['source'],
        label=recipe_args['label'],
        api=recipe_args['api'],
        loader=recipe_args['loader'],
        patterns=recipe_args['patterns'],
        long_text=("Long text", "flag", "L", bool),
        exclude=recipe_args['exclude'])
def teach(dataset, spacy_model, source=None, label='', api=None, patterns=None,
          loader=None, long_text=False, exclude=None):
    """
    Collect the best possible training data for a text classification model
    with the model in the loop. Based on your annotations, Prodigy will decide
    which questions to ask next.
    """
    log('RECIPE: Starting recipe textcat.teach', locals())
    DB = connect()
    nlp = spacy.load(spacy_model, disable=['ner', 'parser'])
    log('RECIPE: Creating TextClassifier with model {}'
        .format(spacy_model))
    model = TextClassifier(nlp, label.split(','), long_text=long_text)
    stream = get_stream(source, api, loader, rehash=True, dedup=True,
                        input_key='text')
    if patterns is None:
        predict = model
        update = model.update
    else:
        matcher = PatternMatcher(model.nlp, prior_correct=5., prior_incorrect=5.)
        matcher = matcher.from_disk(patterns)
        log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
        # Combine the textcat model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        predict, update = combine_models(model, matcher)
    # Rank the stream. Note this is continuous, as model() is a generator.
    # As we call model.update(), the ranking of examples changes.
    stream = prefer_uncertain(predict(stream))
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': stream,
        'exclude': exclude,
        'update': update,
        'config': {'lang': nlp.lang, 'labels': model.labels}
    }

Also make sure to import the PatternMatcher:

from ..models.matcher import PatternMatcher

Here are some related threads that you might find helpful as well:

Topic		Replies	Views
Seeding text categorization with phrases textcat , done , custom	9	4205	March 21, 2018
textcat.manual with --patterns argument enhancement , textcat	7	1100	September 25, 2019
Highlight list of terms in `textcat.manual` for binary annonation usage , textcat	2	411	April 21, 2022
Can we bring back --seeds for textcat.teach? textcat , solved	7	520	February 10, 2023
Seeds not recognized by textcat.teach textcat , solved	10	3275	January 23, 2019

Is there a way to highlight seeded terms in textcat.teach?

Related topics