Is there a way to highlight seeded terms in textcat.teach?


(Oliver Beavers) #1

Hello, dealing with 1-2 pages of text per classification.

Criteria is simple enough, but hard to spot as the line breaks have been removed. Is there a way to highlight seeded terms in each text item?

Seeding text categorization with phrases
(Ines Montani) #2

Yes, but it’s not yet supported out-of-the–box. (The main problem at the moment is that the seed terms logic only returns the tasks but not the actual match.) However, we’ve been rewriting the textcat.teach to use the PatternMatcher and make it consistent with ner.teach. You’ll then be able to set the --patterns argument with a match patterns file, which should also give you a lot more flexibility than just string matches via seed terms. The matched spans are then highlighted in the same style as named entities.

Here’s an updated version of textcat.teach that you could try:

        long_text=("Long text", "flag", "L", bool),
def teach(dataset, spacy_model, source=None, label='', api=None, patterns=None,
          loader=None, long_text=False, exclude=None):
    Collect the best possible training data for a text classification model
    with the model in the loop. Based on your annotations, Prodigy will decide
    which questions to ask next.
    log('RECIPE: Starting recipe textcat.teach', locals())
    DB = connect()
    nlp = spacy.load(spacy_model, disable=['ner', 'parser'])
    log('RECIPE: Creating TextClassifier with model {}'
    model = TextClassifier(nlp, label.split(','), long_text=long_text)
    stream = get_stream(source, api, loader, rehash=True, dedup=True,
    if patterns is None:
        predict = model
        update = model.update
        matcher = PatternMatcher(model.nlp, prior_correct=5., prior_incorrect=5.)
        matcher = matcher.from_disk(patterns)
        log("RECIPE: Created PatternMatcher and loaded in patterns", patterns)
        # Combine the textcat model with the PatternMatcher to annotate both
        # match results and predictions, and update both models.
        predict, update = combine_models(model, matcher)
    # Rank the stream. Note this is continuous, as model() is a generator.
    # As we call model.update(), the ranking of examples changes.
    stream = prefer_uncertain(predict(stream))
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': stream,
        'exclude': exclude,
        'update': update,
        'config': {'lang': nlp.lang, 'labels': model.labels}

Also make sure to import the PatternMatcher:

from ..models.matcher import PatternMatcher

Here are some related threads that you might find helpful as well:

Textcat.teach not using the pattern file
Textcat.teach not using the pattern file
(Oliver Beavers) #3

Awesome, thanks!

(Ines Montani) #4

Just released v1.4.0, which lets you bootstrap textcat.teach with a patterns file instead of only seed terms (just like ner.teach) :tada:

Textcat.teach not using the pattern file