I'm on a compressed timeline with a short text classification use case involving 26k described course prerequisites (e.g. "Knowledge of MS Office, HS Diploma/GED" vs "MS Office"); each described course prerequisite is represented as a freeform short text field that is splittable into mostly discrete prereqs by a fairly simple rule matcher that focuses on what non-prerequistes look like (e.g. CCONJ, '/', ',', etc.) to create spans that yield actual prerequisites. The end result here is a 2x-10x size list of prerequisites in unstandardized language.
Since the data is freeform I have three issues I'm trying to simultaneously address and would love any best practice or insight:
[Q1] How best/Can Prodigy help annotate when the number of classes is unknown?
[Q2] Under unknown classes can Prodigy still iterative re-label as more annotation become available?
[Q3] Can Prodigy relabel/rename a class in the middle of an annotation run?
I doubt there are more than 1000 possible prerequisites and likely less, like 500 - 700.
The above 3 questions relate to an off the cuff approach I'm considering:
- Assume, say, there are 500 classes. Label them 1 through 500 a priori
- Iteratively assign a class to a new prerequisites when I see it. Here the number of classes is fixed under an active learning problem and the semantic meaning of the class is decided by the human at run time.
- When a class is newly assigned, give it a new label, e.g. 4 --> MS Office, for better class name tracking and annotation efficiency
What might be a better approach? What might raise an issue here?