Classification with unknown number of classes

I'm on a compressed timeline with a short text classification use case involving 26k described course prerequisites (e.g. "Knowledge of MS Office, HS Diploma/GED" vs "MS Office"); each described course prerequisite is represented as a freeform short text field that is splittable into mostly discrete prereqs by a fairly simple rule matcher that focuses on what non-prerequistes look like (e.g. CCONJ, '/', ',', etc.) to create spans that yield actual prerequisites. The end result here is a 2x-10x size list of prerequisites in unstandardized language.

Since the data is freeform I have three issues I'm trying to simultaneously address and would love any best practice or insight:

[Q1] How best/Can Prodigy help annotate when the number of classes is unknown?

[Q2] Under unknown classes can Prodigy still iterative re-label as more annotation become available?

[Q3] Can Prodigy relabel/rename a class in the middle of an annotation run?

I doubt there are more than 1000 possible prerequisites and likely less, like 500 - 700.

The above 3 questions relate to an off the cuff approach I'm considering:

  • Assume, say, there are 500 classes. Label them 1 through 500 a priori
  • Iteratively assign a class to a new prerequisites when I see it. Here the number of classes is fixed under an active learning problem and the semantic meaning of the class is decided by the human at run time.
  • When a class is newly assigned, give it a new label, e.g. 4 --> MS Office, for better class name tracking and annotation efficiency

What might be a better approach? What might raise an issue here?

I think you'd probably be better off looking into topic-modelling techniques like LDA to cluster your data before you do anything. The Gensim package has a good suite of utilities for this.

Once you've gotten unsupervised clusters, you could label the topics rather than the documents, and then go through posts that are classified as having a high probability of belonging to a topic to mark ones which don't in fact belong. You could use this to bootstrap a training set, or you might find the topic models are sufficiently accurate.

Especially in a tight time-frame, I doubt you'll do much better than an LDA with another process. What you're proposing to iteratively develop your set of classes is actually kind of similar to what the LDA will do automatically...But I actually think the LDA will do a better job.

1 Like