Automating the annotation for textcat.teach base on score

Hi i have 20K samples with 25 different labels, if i want to use prodigy,if i call textcat.teach,it will give web API on which i have to through 20k samples to annotate it,later i can train.

Is there any way to automate this process.? sorry is the query is silly.

Just to make sure I understand your question correctly: The 20k examples you have are already labelled, so you want to skip the annotation part?

In this case, you can simply use the prodigy db-in command to import your annotations and add them to a new dataset. All you need to do is convert your data to a format Prodigy can read in – the most convenient would be JSON or JSONL (newline-delimited JSON, which can be read in line by line):

{"text": "Some text", "label": "LABEL"}
{"text": "Some other text", "label": "OTHER_LABEL"}

You can then create a new dataset and import the data:

prodigy dataset my_dataset "Description of my dataset"
prodigy db-in my_dataset /path/to/data.jsonl

If no "answer" key is present on the examples you’re importing, Prodigy will automatically set them all to "answer": "accept" – i.e. import them as correct examples.

Thank you

I have done the same, but accuracy results are very bad on the data which i have imported, for the same data with simple linear svm, i got around 63% accuracy, but with prodigy train not so good results.

Note : i am using German news model, i am dealing with German data and it very imbalanced.

Loaded model de_dep_news_sm
Using 20% of examples (819) for evaluation
Using 100% of remaining examples (3278) for training
Dropout: 0.2 Batch size: 10 Iterations: 5


01 4728.025 0.999 0.999
02 5579.310 0.999 0.999
03 6250.201 0.999 0.999
04 6411.890 0.999 0.999
05 6350.550 0.999 0.999

accept accept 818
accept reject 0
reject reject 0
reject accept 1

Correct 818
Incorrect 1

Baseline 1.00
Precision 1.00
Recall 1.00
F-score 1.00
Accuracy 1.00

1 Like

Well, according to the results you’ve posted, Prodigy thinks the accuracy is 1.0, i.e. 100% – which is obviously suspicious. The reason for this is that you’re currently only training on "accept" examples, i.e. correct ones. So in this case, your model has simply learned that “everything is correct”, which leads to a 100% accuracy, but is obviously pretty useless overall.

The solution is to add a bunch of “wrong” examples – ideally, the same amount as “correct” examples, so you have a nice 50/50 split. You can also do this programmatically by simply swapping out the labels in your existing dataset. (In the long run, you should probably also consider creating wrong examples using different texts.)

Then you add those examples to your Prodigy dataset, and set the answer to "reject":

prodigy db-in my_dataset /path/to/wrong_data.jsonl --answer reject

You should now have 40k annotations in your set – 20k correct and 20k incorrect. Now running batch-train should give you a more realistic accuracy score, and hopefully beat your previous 63% :grinning:

thank you , will try :slight_smile: , hopefully i will beat the 63% :wink:

1 Like