Custom textcat for 2nd level

jiebei · December 17, 2022, 8:13pm

Hello! I am working on text classification with a hierarchical schema. No I am trying to pass the classified text obtained from top level to the second level with a custom recipe. I am following this post,
The problems that I have now are

I need to use textcat.manul, with the options to select for the 2nd level classification
No labels from top has been passed, and no options on the 2nd level is shown
I am not sure what I have missed in my scripts. Thank you so much for your thoughts!

hierarchy = {'RESEARCH': ['find a book/article', 'Citations', 'Databases/electronic resources','ILL/EZborrow','Data'], 'NON-RESEARCH': ['Room reserve/Building spaces/directions', 'Printer/Scanner/Technical']}

def get_stream(examples):
    for eg in examples:   # the examples with top-level categories
        top_labels = eg['accept']  # ['A'] or ['B', 'C'] if multiple choice
        for label in top_labels:
            sub_labels = hierarchy[label]
            options = [{'id': opt, 'text': opt} for opt in sub_labels]
            # create new example with text and sub labels as options
            new_eg = {'text': eg['text'], 'options': options}
            yield new_eg
            
import prodigy 
from prodigy.components.db import connect
from prodigy.components.loaders import JSONL
           
@prodigy.recipe(
    "custom_textcat_2ndlevel",
    dataset=("Dataset loader TEXTCAT annotations from", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
)           
def custom_recipe(dataset, source):
    stream = JSONL(source)
    stream = get_stream(stream)
    
    return {
        'view_id': 'classification',       # Annotation interface to use
        'dataset': dataset,     # Name of dataset to save annotations
        'stream': stream,       # Incoming stream of examples
    }

I got text without any labels from top or second levels.
My prodigy commend is

python -m prodigy custom_textcat_2ndlevel chat_correct-2 .\top_out\chat_correct-1.jsonl -F 2level.py

Jette16 · December 19, 2022, 8:51am

Hello @jiebei,
Thank you for your question!
The problem you describe results from a wrong chosen view-id. The classification interface is a binary interface, i.e., prodigy renders one label at the top of the task such that the user can choose whether this label applies to the text/image shown. However, you need the choice interface to get the options displayed. The dictionary your recipe returns should therefore look like this:

return {
    'view_id': 'choice',      
    'dataset': dataset,
    'stream': stream,
}

or, if you want to select multiple options for each task, like this:

return {
    'view_id': 'choice',
    'config': {
        'choice_style': 'multiple',
        },
    'dataset': dataset,
    'stream': stream, 
}

Please let me know, if this solves your problem and if you have any further questions

jiebei · December 20, 2022, 3:06pm

You suggested change works perfectly! Thank you!

I do have a follow-up question. After I complete the textcat.mannual and want to move to the textcat.correct with the same hierarchical schema, I am not sure how I can modify the code to achieve this. Could you share your thoughts on that? Do you happen to have any past Q&A posts that address a similar issue? Thanks again!

ryanwesslen · December 22, 2022, 8:15pm

hi @jiebei!

Have you thought about how/when you will combine your different textcat components into one spaCy pipeline (model)?

This is important as to use a correct recipe, you'll need an existing model. Was your expectation that you would have two different correct recipes -- one for level one and another for level two?

This is likely the easiest approach and I would suggest to do this. For this, you would continue annotating/training each level separately until you're done. Then you'll need to assemble the two textcat models into one pipeline.

correct recipes: one for each level

To create a correct recipe for each level, check out the Prodigy recipe repo, specifically the code for the textcat_correct recipe.

github.com

explosion/prodigy-recipes/blob/master/textcat/textcat_correct.py

import copy
from typing import List, Optional
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import split_string
import spacy
from spacy.tokens import Doc
from spacy.training import Example


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "textcat.correct",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
    update=("Whether to update the model during annotation", "flag", "UP", bool),

This file has been truncated. show original

You can add the get_stream function you currently have which will add the options per the appropriate labels.

The add_suggestions which provides your existing models' predictions. I think you may not need to modify this. You can keep or remove the additional function for update which is for "incremental learning".

Combining two textcat models into one spaCy pipeline

This is a bit tricky as mentioned here:

They key would be to train (e.g., prodigy train) two different textcat models independently. This would mean you have two different models, each with a respective folder. Let's call one textcat_1 and the second as textcat_2. Each of those folders will have two sub-folders: model-best and model-last.

When you're good with these two models' performance and want to combine, you'll need to follow these instructions to source and assemble the two models using one new config.cfg.

Following those instructions and starting in the same folder where you have the two model (folders) of textcat_1 and textcat_2, save this file as your combine model's config file:

# config.cfg
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["textcat1","textcat2"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat1]
source = "textcat_1/model-best"
component = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v1"}
threshold = 0.5

[components.textcat1.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat2]
source = "textcat_2/model-best"
component = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v1"}
threshold = 0.5

[components.textcat2.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[corpora]
@readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0
ner = null
textcat_multilabel = null
parser = null
tagger = null
senter = null
spancat = null

[corpora.textcat]
@readers = "prodigy.TextCatCorpus.v1"
datasets = ["textcat_1","textcat_2"]
eval_datasets = []
exclusive = true

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Two things to notice:

This is like the default config.cfg output for textcat when using prodigy train. However, it includes two textcat components: textcat1 and textcat2.
The source provides where the original components are sourced from. In this example, I take the model-best from each of the two components. You can change this to model-last if you prefer the most recent model run instead of the best one.

With this new file, you'll then need to run spacy assemble using that config.cfg file:

python -m spacy assemble config.cfg combined_model

This will create a new folder called combined_model. It's important to note that then you can run/process that model as you would a normal spaCy model:

import spacy
nlp = spacy.load("complete_model")
doc = nlp("This is a test sentence to score.")
doc.cats
{'LABEL1': 0.16814707219600677, 'LABEL2': 0.7714870572090149, 'LABEL3': 0.06036587804555893, 'LABEL1_A': 0.028656000271439552, 'LABEL1_B': 0.10963789373636246, 'LABEL2_A': 0.6855509877204895,  'LABEL2_B': 0.049830008298158646, 'LABEL3_A': 0.08758322149515152, 'LABEL3_B': 0.0387418232858181}

This would assume that the label hierarchy is:

hierarchy = {'LABEL1': ['LABEL1_A','LABEL1_B'], 'LABEL2': ['LABEL2_A','LABEL2_B'], 'LABEL3': ['LABEL3_A','LABEL3_B']}

Notice how the top level labels ['LABEL1','LABEL2','LABEL3'] model predictions sum to 1 while the other second level labels sum to 1.

Hope this helps!

jiebei · January 20, 2023, 4:00pm

Sorry for getting back late and thanks for the detailed instructions! A quick follow up question, I am curious how the hierarchy is realized in this customized script, shall I place my top level textcat model as textcat1, and second level as textcat2 to realize that? Thank you!

ryanwesslen · January 23, 2023, 3:07pm

hi @jiebei!

In this case I don't think it matters as they are independent of one another. Order does matter in spaCy pipelines if one component is used as an input in another. Here's a good FAQ in spaCy that explains it.

Topic		Replies	Views
Nested hierarchy for textcat usage , textcat , solved	13	1209	January 26, 2024
textcat: 2-level hierarchical classification textcat	15	670	July 5, 2023
Hierarchal text classification trouble shooting usage , textcat	5	541	August 17, 2021
textcat.batch-train question	7	496	November 28, 2022
Hierarchal text classification process textcat , spacy	2	582	May 17, 2021

Custom textcat for 2nd level

correct recipes: one for each level

Combining two textcat models into one spaCy pipeline

Related topics