Just to make sure I understand your question correctly: The 20k examples you have are already labelled, so you want to skip the annotation part?
In this case, you can simply use the prodigy db-in command to import your annotations and add them to a new dataset. All you need to do is convert your data to a format Prodigy can read in – the most convenient would be JSON or JSONL (newline-delimited JSON, which can be read in line by line):
I have done the same, but accuracy results are very bad on the data which i have imported, for the same data with simple linear svm, i got around 63% accuracy, but with prodigy train not so good results.
Note : i am using German news model, i am dealing with German data and it very imbalanced.
Loaded model de_dep_news_sm
Using 20% of examples (819) for evaluation
Using 100% of remaining examples (3278) for training
Dropout: 0.2 Batch size: 10 Iterations: 5
Well, according to the results you’ve posted, Prodigy thinks the accuracy is 1.0, i.e. 100% – which is obviously suspicious. The reason for this is that you’re currently only training on "accept" examples, i.e. correct ones. So in this case, your model has simply learned that “everything is correct”, which leads to a 100% accuracy, but is obviously pretty useless overall.
The solution is to add a bunch of “wrong” examples – ideally, the same amount as “correct” examples, so you have a nice 50/50 split. You can also do this programmatically by simply swapping out the labels in your existing dataset. (In the long run, you should probably also consider creating wrong examples using different texts.)
Then you add those examples to your Prodigy dataset, and set the answer to "reject":