Hi i have 20K samples with 25 different labels, if i want to use prodigy,if i call textcat.teach,it will give web API on which i have to through 20k samples to annotate it,later i can train.
Is there any way to automate this process.? sorry is the query is silly.
Just to make sure I understand your question correctly: The 20k examples you have are already labelled, so you want to skip the annotation part?
In this case, you can simply use the prodigy db-in command to import your annotations and add them to a new dataset. All you need to do is convert your data to a format Prodigy can read in β the most convenient would be JSON or JSONL (newline-delimited JSON, which can be read in line by line):
You can then create a new dataset and import the data:
prodigy dataset my_dataset "Description of my dataset"
prodigy db-in my_dataset /path/to/data.jsonl
If no "answer" key is present on the examples youβre importing, Prodigy will automatically set them all to "answer": "accept" β i.e. import them as correct examples.
I have done the same, but accuracy results are very bad on the data which i have imported, for the same data with simple linear svm, i got around 63% accuracy, but with prodigy train not so good results.
Note : i am using German news model, i am dealing with German data and it very imbalanced.
Loaded model de_dep_news_sm
Using 20% of examples (819) for evaluation
Using 100% of remaining examples (3278) for training
Dropout: 0.2 Batch size: 10 Iterations: 5
Well, according to the results you've posted, Prodigy thinks the accuracy is 1.0, i.e. 100% β which is obviously suspicious. The reason for this is that you're currently only training on "accept" examples, i.e. correct ones. So in this case, your model has simply learned that "everything is correct", which leads to a 100% accuracy, but is obviously pretty useless overall.
The solution is to add a bunch of "wrong" examples β ideally, the same amount as "correct" examples, so you have a nice 50/50 split. You can also do this programmatically by simply swapping out the labels in your existing dataset. (In the long run, you should probably also consider creating wrong examples using different texts.)
Then you add those examples to your Prodigy dataset, and set the answer to "reject":
You should now have 40k annotations in your set β 20k correct and 20k incorrect. Now running batch-train should give you a more realistic accuracy score, and hopefully beat your previous 63%