Best practices & realistic expectations with high number of classes for multiclass text classification task


I am trying to convert an existing classifier based on fasttext to spacy, mostly because it is a much easier library to ship/distribute, and is well integrated with Prodigy (because annotations matters, and need to become part of a regular workflow).

I am witnessing pretty low accuracy so far, and I am wondering if it is because I missed something in the way spacy/prodigy work in tandem. Here is what I have tried so far:

  • took a 60k record "gold" dataset and imported it into a Prodigy Dataset via db-in (and confirmed all the 43 labels are present, and all marked as "accept")
  • automated an "active-learner" to connect via the api to prodigy textcat.teach recipe in order to generate a healthy amount of "reject" for the examples that prodigy selects having the lowest confidence. I stop this process when I see I have about and 50/50 split in my dataset between accepted/rejected samples. It amounts then to ~120k record dataset
  • run prodigy textcat.batch-train mydataset --n-iter 10 --batch-size 1000 --dropout 0.2 -E (because it is a multi-class problem, not multi-label, the classes are mutually exclusive)

The current results are the following:

Using 20% of examples (24103) for evaluation
Using 100% of remaining examples (96414) for training
Dropout: 0.2  Batch size: 1000  Iterations: 20  

#            LOSS         F-SCORE      ACCURACY  
01           0.000        0.000        0.496                                                      
02           0.000        0.093        0.511                                                      
03           0.000        0.093        0.511                                                      
04           0.000        0.078        0.508                                                      
05           0.000        0.093        0.511                                                      
06           0.000        0.078        0.508                                                      
07           0.000        0.078        0.508                                                      
08           0.000        0.093        0.511                                                      
09           0.000        0.093        0.511                                                      
10           0.000        0.093        0.511  

Now, I have tried with a smaller amount of classes (2, 3 and 4) on a subset of the data and it works much better in this case (between 0.85-0.95 accuracy), but it is because the problem is obviously much easier. Is there anything else that other users of Prodigy/Spacy have noticed when dealing with multi-class classification problems, where the number of classes is not small (>20). Should I look into trying a more custom approach in Spacy versus leveraging the built in textcat.batch-train?

Thank you in advance for your guidance!

It's very possible that we have some unideal settings for large-class problems, because the architectures in Prodigy were mostly optimised for datasets with fewer categories. That said, I'd expect to usually match the performance of the FastText textcat models, because our model architecture should be able to extract the same information they're extracting.

I think the most likely problem here is that the model isn't being set up to predict mutually exclusive classes, which is why you've had to generate those negative examples. If you're training the model with FastText, I'm assuming your data is such that only one label is correct per example. So the important thing is to make sure spaCy is set up with that knowledge.

I would go ahead and use spaCy directly, rather than using Prodigy textcat.batch-train, simply so that you have one less layer of software. It also means you'll be training the model with open-source tooling, which is always going to be preferable to having your automation depend on a proprietary tool (even when the proprietary tool is ours -- I couldn't give the opposite advice with a straight face :slight_smile:) .

You should be able to use the example script here:

Two things are important:

  1. Make sure you're passing the "exclusive_classes": True setting.
  2. Make sure you're setting up your "cats" dict so that one label is 1.0, and all of the other labels are provided as 0.0. In other words, you need a "dense" format, with no missing values.

@adriane has been working on the usability of the textcat to make these things easier, and to make sure the process is less error-prone. But you should already be able to see useful results quite quickly.

One outcome of the experiments Adriane has been running is that the "architecture": "bow" setting often performs very well. I would be sure to try that out first, especially in your early experiments while you're trying to get things set up correctly. It will run far faster than the "simple_cnn", which should speed up your process of getting everything correct.

HI @honnibal, thank you so much for your detailed response! You were actually right on target when you said:

It turns out that from what I experienced, that when annotating a database in prodigy using the recipe textcat.teach, it appears to put the dataset in a mode, where the "exclusive" param is ignored. I did actually find a reference in the prodigy code (, line 203) to this to confirm this behavior:

    # Make sure that examples in datasets created with a choice interface are
    # converted to "regular" text classification tasks with a "label" key
    examples = convert_options_to_cats(examples, exclusive=exclusive)

I think a warning would be nice if there is an underlying reason to not format the examples with the explicit 0.0 for all the other labels in "cats" when the dataset is created using textcat.teach vs textcat.manual.

Once I fixed this though the results were much better and comparable to FastText, so thank you for suggesting this!

I actually also decided to switch to use the spacy as a template, and I indeed like it better. It does not add a dependency to Prodigy at model training time and is a lot more customizable. The best thing about doing it the verbose way like this, is the ability to get access to the metrics after each epoch (since I am integrating this with MLFlow, and can log easily all of this for each experiment to track the best possible params).

I also did try your suggestion and switched to "architecture": "cbow" and it did converge faster than the "simple_cnn", but the difference in terms of accuracy is negligible overall for my data.

Last quick question though, is there a way in Spacy to specify a tokenizer that works at the character level instead of the word level? I have an upcoming problem that might be better addressed by looking at character sequences versus word sequences. In that spirit is this something that Prodigy can support as well in the active learning (textcat.teach) mode?

Thanks again for all your help!