We are using prodigy to
en_core_web_md for people and organisations in the domain of the documents but it currently takes ~40m for prodigy to start serving documents to annotate.
I read ner.teach very slow - #6 by ines and some similar responses and I understand that the problem might be arising from trying to find segments with entities. But given that we use a JSONL and that we have filtered it so that we only keep the documents that contain entities, 40 minutes sound like a lot.
It is worth saying that the documents are quite long but since we know that they contain the relevant entities, should it take that much time? Running
nlp(text) to each document is slow as well but not that slow, it takes approx 2s per document (We assume prodigy does not use pipe).
I enabled the logs to get more information and this is the segment where it gets stuck
15:09:28: FEED: Finding next batch of questions in stream 15:42:52: RESPONSE: /get_session_questions (10 examples)
with no information in between.
Is there a way we can speed up the loading? Are we doing anything wrong?
P.S. We are using prodigy-highly and spacy 3. Let me know if you need more information on our environment, data or anything.