Ohhh, okay – thanks for the clarification. I think I know what the problem is.
What format does your dataset have and are the sentences all one string? When you load in the data from a text file, Prodigy will try to stream it in line by line and split each text into individual sentences. This is usually quite fast, because it can be processed as a stream – but if the first (and only) line consists of 800k sentences, Prodigy has to read it all in, process it all with spaCy and split it into sentences before you can get started. This needs a lot of memory, which your machine likely doesn’t have.
So if that’s what’s going on, try to provide your texts in a format that can be read in line-by-line, e.g.
.txt or JSONL (newline-delimited JSON) with one sentence or paragraph per line.