Importing existing text classification data with binary labels

jan · November 27, 2017, 10:56pm

i “initialize” a prodigy dataset using db-in: i have 2 csv files: 283 positive examples and 461 negative examples. both files have a “text” and a “label” column (the label column has values “1” in the positive examples and “0” for the negative). The text are arbitrary sentences, and the labels indicate the presence of a certain type of claim in that sentence.

When calling db-in i use the setting --answer accept and --answer reject respectively.

when i then run batch-train on that dataset it shows that its using 560 examples for training.

However i get this output:

LOSS F-SCORE ACCURACY

01 36.179 1.000 1.000
02 36.881 1.000 1.000
03 36.879 1.000 1.000
04 36.972 1.000 1.000
05 36.988 1.000 1.000
06 36.989 1.000 1.000
07 36.992 1.000 1.000
08 36.986 1.000 1.000
09 36.962 1.000 1.000
10 36.966 1.000 1.000

MODEL USER COUNT
accept accept 48
accept reject 0
reject reject 91
reject accept 0

Correct 139
Incorrect 0

Baseline 0.65
Precision 1.00
Recall 1.00
F-score 1.00
Accuracy 1.00

I use a similar dataset for “textcat.teach”, this time no db-in, but i label a few (~30) sentences through the ui, then run the batch-train. Now the results look valid:

LOSS F-SCORE ACCURACY

01 2.926 0.444 0.545
02 3.308 0.600 0.636
03 3.877 0.667 0.727
04 3.604 0.750 0.818
05 3.270 0.750 0.818
06 2.366 0.750 0.818
07 2.777 0.750 0.818
08 2.676 0.750 0.818
09 2.603 0.750 0.818
10 2.430 0.750 0.818

MODEL USER COUNT
accept accept 3
accept reject 2
reject reject 6
reject accept 0

Correct 9
Incorrect 2

Baseline 0.73
Precision 0.60
Recall 1.00
F-score 0.75
Accuracy 0.82

One difference i notice is that the db-in created records have no “score” and no “meta” field. the “answer” looks good on each on them and the “label” is “0” or “1” (not sure that matters, not clear what the “label” is used for …

btw, i am using prodigy-0.5.0

never mind, i figured it out, was not obvious to me that the label is actually the one that gets “accepted” or “rejected”, thought it was some general dataset wide accept / reject …
working now

ines · November 28, 2017, 1:34am

Yes, this is correct Thanks for updating with your solution.

I'll add a note about this in db-in section of the docs – maybe even a little infobox or section with some general tips for importing existing datasets. Prodigy's accept/reject concept is a little different from how most other annotation data is structured, so this will probably be relevant to other users as well.

Topic		Replies	Views
Automating the annotation for textcat.teach base on score usage , textcat	4	1048	October 25, 2017
db-in command imports everything as "accept" usage , database , solved	5	626	September 24, 2019
Textcat - teach to train. usage , textcat	2	553	September 1, 2022
Problem with annotation usage , textcat , solved	5	725	June 2, 2020
Evaluating a text classification model usage , textcat	4	794	September 24, 2019

Importing existing text classification data with binary labels

LOSS F-SCORE ACCURACY

LOSS F-SCORE ACCURACY

Related topics