From Choice annotations to binary annotations with Teach

ines · January 2, 2019, 4:43pm

The default text classification model (via spaCy) assumes that categories are not mutually exclusive – so if you update the model with a text plus a category, the update is only performed for that label and all other labels are treated as unknown / missing values. Prodigy uses the same approach for binary NER annotations btw – my slides here show an example of this process.

Yeah, this sounds reasonable. The uncertainty sampling is performed by the prefer_uncertain sorter, which takes a stream of (score, example) tuples and yields examples. Under the hood, it uses an exponential moving average to determine whether to send out an example or not. Instead of prefer_uncertain, you can also use the prefer_high_scores sorter, which has the same API, but prioritises high scores.

So in recipes/textcat.py, you could update the teach recipe like this:

from prodigy.components.sorters import prefer_high_scores

# in the recipe:
stream = prefer_high_scores(model(stream))

Our prodigy-recipes repo also has a simplified version of the textcat.teach recipe with a bunch of comments explaining what's going on. So you might find this useful as well as a starting point to write your own custom version:

github.com

explosion/prodigy-recipes/blob/master/textcat/textcat_teach.py

from typing import List, Optional
import spacy
from spacy.training import Example
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.models.textcat import TextClassifier
from prodigy.models.matcher import PatternMatcher
from prodigy.components.sorters import prefer_uncertain
from prodigy.util import combine_models, split_string


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "textcat.teach",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),

This file has been truncated. show original

Topic		Replies	Views
textcat.teach for multi-class classification textcat	3	515	June 19, 2023
textcat.teach with multiple choice interface? usage , textcat	9	1360	November 3, 2020
Interface error with text cat.teach? usage , textcat	1	583	March 20, 2018
Efficient binary annotation using textcat.teach usage , textcat	3	584	December 20, 2019
Custom multilabel categorization recipe textcat , spacy , front-end , solved	12	6278	August 3, 2020

From Choice annotations to binary annotations with Teach

Related topics