Multilabel text classification

Lingo · January 28, 2022, 6:48am

Hi,
I am new to Prodigy, and try to load in a dataset with multiple and exclusive labels. The json file with two data look like this:
[
{"text": "This is about robot science", "label": "TECHNOLOGY"},
{"text": "This is about money","label": "ECONOMY"}
]

The data is imported as follows:
prodigy db-in data_minimal data.json

Then I start Prodigy with the following command:
prodigy textcat.manual data_minimal data.json --label TECHNOLOGY,ECONOMY

But I only get one label that consist of both labels merged:

What I really want to do is to train a bert model for email categorization and then use Prodigy to manually control emails that are wrongly classified.
Is there a basic tutorial except for the support sites? I couldn't find anything really basic about how to get started on Youtube.
Thanks
Anders

ines · January 28, 2022, 10:09am

Hi! Are you using Windows PowerShell by any chance? It seems to have this quirk that needs you to add strings explicitly in quotes, so try --label "TECHNOLOGY,ECONOMY" instead.

Btw, you shouldn't have to export anything into your dataset before you start annotating – the data will be read in automatically from the file when you start the server. The dataset is intended for the collected annotations, not the raw data. If you import raw data, you end up with unannotated examples in your dataset, which is typically not what you want.

Lingo · January 28, 2022, 12:34pm

Thanks a lot for rapid and concise answer! Using quotes solved my first problem. If I understand you correctly, you recommend me to start the annotation session with the following command:

prodigy textcat.manual - data.json --label "TECHNOLOGY,ECONOMY"

Will Prodigy still save the annotated data in "data_minimal" then? Or where else will it be stored?
Anders

nix411 · January 28, 2022, 1:18pm

I can answer that. It'll save it to your database directly (into the dataset you expressed). See details here. So it would look something like this

prodigy textcat.manual email-labels ./data.json --label "TECHNOLOGY,ECONOMY"

assuming you have data.json looking something like this

[{"text": "some text"}, {"text": "some other text"}]

The above command would save your annotations into a dataset called email-labels

Lingo · January 28, 2022, 2:04pm

Thanks a lot! It takes some time to get used to the syntax here, but this brought me a long way further.

Topic		Replies	Views
What is the input format for annotated multi-label text classification data Getting Started textcat , solved	2	769	July 10, 2020
textcat-multilabel annotations format textcat	2	209	January 26, 2024
Automating the annotation for textcat.teach base on score usage , textcat	4	1054	October 25, 2017
Text Classification Custom Label Issue usage , textcat	5	387	October 27, 2021
Textcat model with multiple classes usage , textcat	5	1543	November 1, 2019

Multilabel text classification

Related topics