Prodigy is slow at loading annotations

We are using prodigy to ner.correct en_core_web_md for people and organisations in the domain of the documents but it currently takes ~40m for prodigy to start serving documents to annotate.

I read ner.teach very slow - #6 by ines and some similar responses and I understand that the problem might be arising from trying to find segments with entities. But given that we use a JSONL and that we have filtered it so that we only keep the documents that contain entities, 40 minutes sound like a lot.

It is worth saying that the documents are quite long but since we know that they contain the relevant entities, should it take that much time? Running nlp(text) to each document is slow as well but not that slow, it takes approx 2s per document (We assume prodigy does not use pipe).

I enabled the logs to get more information and this is the segment where it gets stuck

15:09:28: FEED: Finding next batch of questions in stream
15:42:52: RESPONSE: /get_session_questions (10 examples)

with no information in between.

Is there a way we can speed up the loading? Are we doing anything wrong?

P.S. We are using prodigy-highly and spacy 3. Let me know if you need more information on our environment, data or anything.

40 minutes is definitely way too long, that indicates that there must be something else going on, besides the model just being slow at processing.

How long are your documents and how many examples/sentences are in your JSONL? Do you have many examples in the JSONL that are already in the dataset? From looking at the logs, it seems like Prodigy spends the 40 minutes going through your source data trying to find the next batch to send out. If you're working with one huge JSONL file and a large number of examples is already annotated, this could potentially lead to startup taking longer over time, because each example is processed and then skipped. In that case, you could split your file up into smaller portions, so if you've already gone through 1000 examples, you can start at example 1001 instead of at the beginning.

Prodigy pretty much always uses nlp.pipe for efficiency, including in ner.correct. It batches the examples up in batches of 10, which should be pretty fast.

Hi Ines,

Thanks for the quick response. We have 1736 docs in the JSON with a mean sentence length of 508 and a std of 2134 so it varies a lot.

We have annotated 6000 segments from 12 docs which does not explain the wait as it takes only 12 seconds for a naive script that passes the doc from nlp and checks if the pdf is annotated to find 10 documents that are not annotated.

Is there a way to tell prodigy to skip some examples or we can only do by passing documents that are not annotated. Is that a best practice for the future, to split the docs to smaller chunks when annotations start to pile up?

Btw we notice that only 1 core is used when we are running prodigy, is that normal given that prodigy uses nlp.pipe? Also in case it makes any difference we are running prodigy and the database inside a docker container.

One option could be to split your JSONL into smaller files, yes. Once you're done with file 1, you can start at file 2, and if you ever restart the server, it'll only have to go through the examples in that file again. The hackier version of this would be to just remove all lines from the top of your JSONL that you know are already annotated in the data.

Also, if you want to really optimize for processing performance, another option could be to do the pre-processing separately, e.g. have a script that runs your spaCy model over your input texts and adds the "tokens" and "spans". You can then run that on a remote machine, potentially even with a GPU, parallelize it etc. and output a static JSONL file that already has everything you need. You can then use that with ner.manual or a custom recipe that only streams in exactly what's in the data and doesn't do any pre-processing.

Prodigy doesn't set n_process on nlp.pipe by default, but you could try and add that in the recipe to see if it makes a difference. (You can run prodigy stats to find the location of your local installation and then just hack it into the ner.correct function.) But I think the absolute fastest solution would be the pre-processing approach I described above.

Thanks for the suggestions, super helpful :blush: