hi @jiebei!
Have you thought about how/when you will combine your different textcat
components into one spaCy pipeline (model)?
This is important as to use a correct recipe, you'll need an existing model. Was your expectation that you would have two different correct recipes -- one for level one and another for level two?
This is likely the easiest approach and I would suggest to do this. For this, you would continue annotating/training each level separately until you're done. Then you'll need to assemble the two textcat
models into one pipeline.
correct recipes: one for each level
To create a correct recipe for each level, check out the Prodigy recipe repo, specifically the code for the textcat_correct
recipe.
You can add the get_stream
function you currently have which will add the options per the appropriate labels.
The add_suggestions
which provides your existing models' predictions. I think you may not need to modify this. You can keep or remove the additional function for update
which is for "incremental learning".
Combining two textcat models into one spaCy pipeline
This is a bit tricky as mentioned here:
They key would be to train (e.g., prodigy train
) two different textcat
models independently. This would mean you have two different models, each with a respective folder. Let's call one textcat_1
and the second as textcat_2
. Each of those folders will have two sub-folders: model-best
and model-last
.
When you're good with these two models' performance and want to combine, you'll need to follow these instructions to source
and assemble
the two models using one new config.cfg
.
Following those instructions and starting in the same folder where you have the two model (folders) of textcat_1
and textcat_2
, save this file as your combine model's config file:
# config.cfg
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0
[nlp]
lang = "en"
pipeline = ["textcat1","textcat2"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.textcat1]
source = "textcat_1/model-best"
component = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v1"}
threshold = 0.5
[components.textcat1.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null
[components.textcat2]
source = "textcat_2/model-best"
component = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v1"}
threshold = 0.5
[components.textcat2.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null
[corpora]
@readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0
ner = null
textcat_multilabel = null
parser = null
tagger = null
senter = null
spancat = null
[corpora.textcat]
@readers = "prodigy.TextCatCorpus.v1"
datasets = ["textcat_1","textcat_2"]
eval_datasets = []
exclusive = true
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
Two things to notice:
- This is like the default config.cfg output for
textcat
when usingprodigy train
. However, it includes two textcat components:textcat1
andtextcat2
. - The
source
provides where the original components are sourced from. In this example, I take themodel-best
from each of the two components. You can change this tomodel-last
if you prefer the most recent model run instead of the best one.
With this new file, you'll then need to run spacy assemble
using that config.cfg
file:
python -m spacy assemble config.cfg combined_model
This will create a new folder called combined_model
. It's important to note that then you can run/process that model as you would a normal spaCy model:
import spacy
nlp = spacy.load("complete_model")
doc = nlp("This is a test sentence to score.")
doc.cats
{'LABEL1': 0.16814707219600677, 'LABEL2': 0.7714870572090149, 'LABEL3': 0.06036587804555893, 'LABEL1_A': 0.028656000271439552, 'LABEL1_B': 0.10963789373636246, 'LABEL2_A': 0.6855509877204895, 'LABEL2_B': 0.049830008298158646, 'LABEL3_A': 0.08758322149515152, 'LABEL3_B': 0.0387418232858181}
This would assume that the label hierarchy is:
hierarchy = {'LABEL1': ['LABEL1_A','LABEL1_B'], 'LABEL2': ['LABEL2_A','LABEL2_B'], 'LABEL3': ['LABEL3_A','LABEL3_B']}
Notice how the top level labels ['LABEL1','LABEL2','LABEL3']
model predictions sum to 1 while the other second level labels sum to 1.
Hope this helps!