When trying to use a dataset >50k, prodigy is unable to start up?

I have prodigy set up on a server for people to use and annotate. When my dataset gets around 50k examples long, it seems to be unable to start up the app (or, the app actually goes down). Just curious if anyone else has come across something like this and if there is a solution besides creating a new dataset for new annotations to get saved to.

I'm using prodigy 1.4.2, btw. The solution might be "use a newer version of prodigy", but that's a work in progress currently.

Hi! Which recipe are you using? And by "dataset", do you mean the Prodigy dataset with the annotated examples, or that data you're loading in?

I'm using a custom written recipe. And yes, I mean the Prodigy dataset with annotated examples. It's totally possible it is something in our recipe, though I also wonder if the slowdown and app-stopping is because of the 'exclude' keyword where we exclude things already in the dataset.

[eta: accidentally posted before I was done!]

Thanks for the update. Are you using the default SQLite database? And are you implementing any custom exclude logic in your recipe?

With 50k+ examples, calls to db.get_dataset are more expensive, so if you do this a lot in your recipe (e.g. in the stream), it's possible you're running out of memory because you keep loading a list of 50k dicts. If you're dealing with a large number of examples, you definitely want to make sure that you're only working with the hashes and ideally only loading them once.

Another thing you could try is setting PRODIGY_LOGGING=basic to log what's going on behind the scenes. You can also use Prodigy's log helper to add your own messages to the log within your recipe. This could give you a rough idea about where it fails and the last thing it does before the server dies.

I'm using a PSQL database and there's no custom exclude logic. I'll try what you suggested, thanks!