One option could be to split your JSONL into smaller files, yes. Once you're done with file 1, you can start at file 2, and if you ever restart the server, it'll only have to go through the examples in that file again. The hackier version of this would be to just remove all lines from the top of your JSONL that you know are already annotated in the data.
Also, if you want to really optimize for processing performance, another option could be to do the pre-processing separately, e.g. have a script that runs your spaCy model over your input texts and adds the "tokens"
and "spans"
. You can then run that on a remote machine, potentially even with a GPU, parallelize it etc. and output a static JSONL file that already has everything you need. You can then use that with ner.manual
or a custom recipe that only streams in exactly what's in the data and doesn't do any pre-processing.
Prodigy doesn't set n_process
on nlp.pipe
by default, but you could try and add that in the recipe to see if it makes a difference. (You can run prodigy stats
to find the location of your local installation and then just hack it into the ner.correct
function.) But I think the absolute fastest solution would be the pre-processing approach I described above.