I need to include two (or more) separate classifiers into my model. For simplicity, imagine I'm classifying text into these categories:
A. Is this document in English? [Binary, mutually exclusive] Classes: Yes, No
B. What animal is this document about? [Multi-label, non-mutually exclusive] Classes: Dogs, Cats, Squirrels
I assume I need to chain together two training sessions, something like this:
Is that the right idea? Also, how will I access my predictions afterwards? After all, I think spaCy's Doc.cats is a single-level dictionary. Will it include all labels in the one place, e.g. {'Dog':0.1,'Cat':0.9,'Squirrel':0.0,'English':0.98}
Yes, you'll probably want to do this as separate training sessions. At runtime there are a few approaches. If you're working with batched data, you might find it easier to just run the first model and save out the results which are predicted to be of the class you're interested in. That way you can just run the second model on that dataset. This makes things a bit easier, because you can work on the two problems separately.
You can also have a single spaCy pipeline that includes both text classifiers, but the built-in support is a bit rougher. For instance, you can't easily train the pipeline in one step using spacy train -- you have to train the components separately and then put them together. The situation with the cats variable is similar. You could save out the previous categories into another variable, or just name the cats something like en_Dog to reflect the combination of categories.
Finally, another option worth mentioning is doing both problems jointly. I think the pipelined approach probably makes more sense, but the joint approach is worth at least thinking about, because it makes a couple of these operational questions simpler.
Excuse me for 'highjacking" this thread but I'm doing something similar.
Before reading this, I was merging my annotated datasets into one dataset and doing the batch training on this.
But I am wondering if this - doing batch train on categoryA / datasetA and then doing batch train of categoryB / datasetB on top of the output model of the first batch train, is not a better idea.
My thought for example is on how it splits the dataset in training/test sets. Is the batch train splitting 80/20% over each label, or only over the entire dataset which would result in some labels not showing up in the test dataset and even possibly, some labels not showing in the training at all!
Seems like this approach of pipelining each label on top of the previous one make more sense.
Thanks for the reply. There are a couple of follow-ups:
I tried the "separate, chained training sessions" approach, but it didn't seem to work. The second training session just overwrote the classifier from the first.
I realise I was entirely clear originally, but I want to treat the two classifications and completely independent variables. So it's not like a hierarchical thing where the results of the first are classified against the second. Rather, I just want two totally separate classifications. In fact, I'd happily train two totally separate models, if it didn't mean loading the massive vectors/lexemes data structures into memory multiple times.
Does all the above mean I need to do the "train the components separately and then put them together" approach you mention? And if so, can you point me to any code/materials that might help me implement that?
The output directory should have your two TextCategorizer models. They'll both set values into the doc.cats variable, so if the second one predicts some of the same classes as the first one, its scores will overwrite the previous one. But if they're predicting different labels, you'll get all of the predictions in the dict at the end.
Thanks, that looks like what I want, but I've been having no fun trying to implement it.
I managed to save my "aggregated" model using the code you gave. But when I tried to import it I got:
KeyError: "[E002] Can't find factory for 'textcat2'. This usually happens when spaCy calls `nlp.create_pipe` with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to `Language.factories['textcat2']` or remove it from the model meta and add it via `nlp.add_pipe` instead."
If I breakpoint at that line, do the copy manually, and then run the rest of the function, it seems to build - but then the resulting .tar.gz is only 4kb, and besides a mention in the meta.json, doesn't include my model.
The main problem here is that spaCy will look up the component names in the factories and it doesn't know what textcat2 is. So the easiest way to fix this is to do the following:
from spacy.language import Language
Language.factories["textcat2"] = Language.facctories["textcat"]
spaCy v2.2 ships with a new API for this under the hood that takes care of this automatically and stores the component name and the factory used for each component in your pipeline. So even if you rename textcat to textcat2, it'll still know that it was created using the textcat factory.
The packaging error looks like a permissions problem? Maybe the Python process can't write to the target directory?
Thank you! I've managed to get things working nicely now. If it helps others (e.g. @etlweather), I needed to merge/save my individual classifiers like this:
def merge_models(name, path_base, paths, path_output):
logging.warning(f'Importing base model "{Path(path_base).stem}"...')
model = spacy.load(path_base)
for path in paths:
logging.warning(f'Merging model "{path.stem}"...')
for textcat_file in (path / 'textcat').glob('*'):
shutil.copy(textcat_file, path / textcat_file.name)
textcat = model.create_pipe("textcat")
textcat.from_disk(path)
model.add_pipe(textcat, name=f'textcat_{path.stem}')
logging.warning(f'Writing aggregated to "{path_output.stem}"...')
model.to_disk(path_output)
The important bit is I needed to specifically copy files from model/textcat/* to model/ before it would work. Unsure why.
Then to import:
@lru_cache()
def import_model(path):
logger.info(f'Importing model from {path}...')
import spacy
from spacy.language import Language
import json
model_meta = json.loads((Path(path) / 'meta.json').read_text())
textcats = [pipe for pipe in model_meta['pipeline'] if pipe.startswith('textcat')]
for textcat in textcats:
Language.factories[textcat] = Language.factories["textcat"]
nlp = spacy.load(path)
return nlp
Apologies if we (@Mayank) are asking conceptual questions.
As of now, we are researching more on ROC AUC for multi-class classification, deciding thresholds etc.
Is there a good reference / link / blog on this specific topic that you know of?
Hi @honnibal, I have a follow up question about this. I have a spaCy project where I train two separate textcat models. I'd like to include both in the Python package when I run spacy package however, there are a few things I'm not sure about.
Is it possible to change the name of the textcat pipelines in the config file of the project e.g. textcat_x and textcat_y? Would I need to specifically state the Language.factory for each of them then too? Is it possible to package these models using spacy package or would I need to write a custom script for packaging them? If I wanted them to write to doc._.cats_x and doc._.cats_y do I need to subclass TextCategorizer and treat it like a custom pipeline component?