I’m trying to add records to my prodigy db using add_examples. It takes like 20 mins to add 300 “records” (already in the prodigy format, which is why I’m using add_examples), and I have one list of records that is 30k long, soooo that’s going to take a while. I’m at a loss how to add it faster; does someone with more knowledge about the prodigy database structure have an idea?
I was just doing this this morning. If you have your examples in a Prodigy-formatted JSONL, you can use the db-in
command to add them to a dataset. You’ll need to make sure the dataset already exists in prodigy (prodigy dataset
etc), and then you can run
prodigy db-in [dataset] [in_file]
There’s more documentation in the README.
I added 3,000 annotations in about 2 seconds, so it should work just fine with your 30k. Weird that add_annotations
is so slow. I’m not sure why that’s the case.
Thanks, Andy! I’ll use that. I know part of my problem was that I wasn’t specifying a list of datasets, just a string name (“dataset” instead of [“dataset”]), which gets interpreted as each letter of the string being the name of a dataset. eyeroll But even once I fixed that, it was faster but not as fast as you’re saying! So yeah, I’ll try that.
Ah, damn – but good thing you caught it! Internally, the method iterates over the value of datasets
and only fails if it's not iterable (but a string obviously is, too). We can add a check for this in Prodigy though so it raises an error if the value isn't a list or tuple!
Yeah, that’d be a nice check! I figured it out finally when I was double checking which datasets were in my database and saw “d”, “a”, “t”, “a”…