When trying to use a dataset >50k, prodigy is unable to start up?

araykhel · September 30, 2019, 4:10pm

I have prodigy set up on a server for people to use and annotate. When my dataset gets around 50k examples long, it seems to be unable to start up the app (or, the app actually goes down). Just curious if anyone else has come across something like this and if there is a solution besides creating a new dataset for new annotations to get saved to.

I'm using prodigy 1.4.2, btw. The solution might be "use a newer version of prodigy", but that's a work in progress currently.

ines · September 30, 2019, 4:21pm

Hi! Which recipe are you using? And by "dataset", do you mean the Prodigy dataset with the annotated examples, or that data you're loading in?

araykhel · September 30, 2019, 6:44pm

I'm using a custom written recipe. And yes, I mean the Prodigy dataset with annotated examples. It's totally possible it is something in our recipe, though I also wonder if the slowdown and app-stopping is because of the 'exclude' keyword where we exclude things already in the dataset.

[eta: accidentally posted before I was done!]

ines · September 30, 2019, 8:13pm

Thanks for the update. Are you using the default SQLite database? And are you implementing any custom exclude logic in your recipe?

With 50k+ examples, calls to db.get_dataset are more expensive, so if you do this a lot in your recipe (e.g. in the stream), it's possible you're running out of memory because you keep loading a list of 50k dicts. If you're dealing with a large number of examples, you definitely want to make sure that you're only working with the hashes and ideally only loading them once.

Another thing you could try is setting PRODIGY_LOGGING=basic to log what's going on behind the scenes. You can also use Prodigy's log helper to add your own messages to the log within your recipe. This could give you a rough idea about where it fails and the last thing it does before the server dies.

araykhel · October 1, 2019, 4:24pm

I'm using a PSQL database and there's no custom exclude logic. I'll try what you suggested, thanks!

Topic		Replies	Views
No Tasks Available for Non-Active Learning Classification custom	3	796	October 1, 2019
Interface is slow when running textcat recipe usage , textcat , front-end	3	604	August 12, 2022
running out of memory or time database	2	639	March 13, 2020
Command "db-in" returns "MySQL server has gone away" database , solved	5	964	August 20, 2019
Saving and retrieving annotations usage , database , custom , solved	7	5095	June 13, 2018

When trying to use a dataset >50k, prodigy is unable to start up?

Related topics