Is there a faster way to add records to a prodigy db than "add_examples"?

(Alexis Raykhel) #1

I’m trying to add records to my prodigy db using add_examples. It takes like 20 mins to add 300 “records” (already in the prodigy format, which is why I’m using add_examples), and I have one list of records that is 30k long, soooo that’s going to take a while. I’m at a loss how to add it faster; does someone with more knowledge about the prodigy database structure have an idea?


(Andy Halterman) #2

I was just doing this this morning. If you have your examples in a Prodigy-formatted JSONL, you can use the db-in command to add them to a dataset. You’ll need to make sure the dataset already exists in prodigy (prodigy dataset etc), and then you can run

prodigy db-in [dataset] [in_file]

There’s more documentation in the README.
I added 3,000 annotations in about 2 seconds, so it should work just fine with your 30k. Weird that add_annotations is so slow. I’m not sure why that’s the case.

1 Like

(Alexis Raykhel) #3

Thanks, Andy! I’ll use that. I know part of my problem was that I wasn’t specifying a list of datasets, just a string name (“dataset” instead of [“dataset”]), which gets interpreted as each letter of the string being the name of a dataset. eyeroll But even once I fixed that, it was faster but not as fast as you’re saying! So yeah, I’ll try that.

1 Like

(Ines Montani) #4

Ah, damn – but good thing you caught it! Internally, the method iterates over the value of datasets and only fails if it’s not iterable (but a string obviously is, too). We can add a check for this in Prodigy though so it raises an error if the value isn’t a list or tuple!

1 Like

(Alexis Raykhel) #5

Yeah, that’d be a nice check! I figured it out finally when I was double checking which datasets were in my database and saw “d”, “a”, “t”, “a”… :sweat: