Importing existing text classification data with binary labels

i “initialize” a prodigy dataset using db-in: i have 2 csv files: 283 positive examples and 461 negative examples. both files have a “text” and a “label” column (the label column has values “1” in the positive examples and “0” for the negative). The text are arbitrary sentences, and the labels indicate the presence of a certain type of claim in that sentence.

When calling db-in i use the setting --answer accept and --answer reject respectively.

when i then run batch-train on that dataset it shows that its using 560 examples for training.

However i get this output:

LOSS F-SCORE ACCURACY

01 36.179 1.000 1.000
02 36.881 1.000 1.000
03 36.879 1.000 1.000
04 36.972 1.000 1.000
05 36.988 1.000 1.000
06 36.989 1.000 1.000
07 36.992 1.000 1.000
08 36.986 1.000 1.000
09 36.962 1.000 1.000
10 36.966 1.000 1.000

MODEL USER COUNT
accept accept 48
accept reject 0
reject reject 91
reject accept 0

Correct 139
Incorrect 0

Baseline 0.65
Precision 1.00
Recall 1.00
F-score 1.00
Accuracy 1.00

I use a similar dataset for “textcat.teach”, this time no db-in, but i label a few (~30) sentences through the ui, then run the batch-train. Now the results look valid:

LOSS F-SCORE ACCURACY

01 2.926 0.444 0.545
02 3.308 0.600 0.636
03 3.877 0.667 0.727
04 3.604 0.750 0.818
05 3.270 0.750 0.818
06 2.366 0.750 0.818
07 2.777 0.750 0.818
08 2.676 0.750 0.818
09 2.603 0.750 0.818
10 2.430 0.750 0.818

MODEL USER COUNT
accept accept 3
accept reject 2
reject reject 6
reject accept 0

Correct 9
Incorrect 2

Baseline 0.73
Precision 0.60
Recall 1.00
F-score 0.75
Accuracy 0.82

One difference i notice is that the db-in created records have no “score” and no “meta” field. the “answer” looks good on each on them and the “label” is “0” or “1” (not sure that matters, not clear what the “label” is used for …

btw, i am using prodigy-0.5.0

never mind, i figured it out, was not obvious to me that the label is actually the one that gets “accepted” or “rejected”, thought it was some general dataset wide accept / reject …
working now

Yes, this is correct :+1: Thanks for updating with your solution.

I’ll add a note about this in db-in section of the docs – maybe even a little infobox or section with some general tips for importing existing datasets. Prodigy’s accept/reject concept is a little different from how most other annotation data is structured, so this will probably be relevant to other users as well.