I've been using Prodigy and Spacy very successfully for various NER tasks for some time.
I'm now trying to train a multi-label text classification model for news articles.
I already have pre-annotated data that contains news headlines and the labels applicable to each example.
What I cannot find anywhere is the format for the input JSONL file for multi-label text classification. I can find examples of single label binary classifiers like the INSULTS dataset in the tutorial where a "label" key is provided along with the text. But for multiple labels, am I supposed to provide a list of labels with this key or repeat each example for every label applicable to it or provide an accept key with all the labels that are applicable similar to what the textcat.manual recipe does?
The documentation is very lacking on this subject. In the docs for Text Classification under the section I already have annotations and just want to train a model., the docs say that we need to supply a text key along with a spans key. Surely, this is for NER model training and not for text classification, right?
Hi! Prodigy's built-in train recipe accepts data created with the binary classification interface (one text, a label and the accept/reject answer), as well as multiple-choice selection created with the choice interface (one text and a list of accepted labels as the "accept"key).
In general, the data format always depends on the annotation interface you want to use. For each interface, the docs should include pretty detailed examples of the expected data formats and config settings, and previews of how the data is going to be shown in the app. For multiple-choice annotation, the interface used is the choice interface. You can see examples of the data format and configuration options here: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP
Ah, thanks, that's a typo! That should probably just say "label" or "options" and then link off to the respective annotation interfaces for data examples.
Thanks! Just so I am absolutely clear, does this mean that I should create my dataset such that a view ID is specified in each example. What is the interface that it will pickup by default if no view ID is specified for text classification? Basically what I want to be able to do can be summarized as follows:
Use db-in to import the seed data (gold standard data)
Run train textcat to get an initial model
Run textcat.teach to improve the model using active learning
What format should I provide when importing the dataset so that the training command will know what to do. My understanding based on your answer is that I can either provide _view_id as classification with every example along with the text and label and repeat each example multiple times for every label applicable to it or provide _view_id as choice with every example along with the text and the accept key with a list of labels (and probably options containing a list of all possible labels). Either way should be fine. Do I have this right?