I have a corpus of 15 documents, average 21,936 tokens per document. list(nlp.pipe([texts]))
completes in about two minutes. I’m trying to run my corpus through ner.teach
. Having the context of the entire document is helpful for the annotator, so I removed the
stream = split_sentences(model.orig_nlp, stream)
line from the teach recipe.
Prodigy can’t handle this. I launch the web interface and see “Loading…” for about 15 minutes and then it crashes and the web interface reads “Error”. The debug log looks like this:
15:01:09 - DB: Initialising database SQLite
15:01:09 - DB: Connecting to database SQLite
15:01:10 - RECIPE: Loading recipe from file teach.py
15:01:10 - RECIPE: Calling recipe 'ner.teach'
15:01:10 - RECIPE: Starting recipe ner.teach
15:01:10 - LOADER: Using file extension 'jsonl' to find loader
15:01:10 - LOADER: Loading stream from jsonl
15:01:10 - LOADER: Rehashing stream
15:01:24 - RECIPE: Creating EntityRecognizer using model en
15:01:37 - MODEL: Added sentence boundary detector to model pipeline
15:01:37 - MODEL: Loading match patterns from disk
15:01:37 - MODEL: Adding 4 patterns
15:01:37 - MODEL: Ensure pattern labels are added to EntityRecognizer
15:01:37 - RECIPE: Created PatternMatcher and loded in patterns
15:01:37 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)
15:01:37 - CONTROLLER: Initialising from recipe
15:01:37 - DB: Creating dataset 'full-doc-effective-date'
15:01:37 - DB: Loading dataset 'full-doc-effective-date' (0 examples)
15:01:37 - DB: Creating dataset '2018-01-15_15-48-37'
15:01:01 - GET: /project
15:01:02 - GET: /get_questions
15:01:02 - CONTROLLER: Iterating over stream
15:01:02 - FILTER: Filtering duplicates from stream
15:01:02 - MODEL: Predicting spans for batch (batch size 64)
Another time I tried the same corpus, Prodigy did manage to parse everything and start presenting me with documents to annotate, but only after an hour or so. And every time it recalculated new candidates it would take another hour. This is running on an average MacBook Pro. Activity Monitor shows huge CPU usage.
I figure your focus has been on annotating short texts because that fits with Prodigy’s “make lots of simple decisions quickly” philosophy. I wouldn’t be surprised if my 21,000 token documents were pushing the system beyond its expected limits, but I want to understand exactly what is going on here.
- Are you surprised that Prodigy crashes in this scenario?
- Why would Prodigy crash when analyzing this corpus when
nlp.pipe()
can handle the whole thing just fine? - Any advice for how to display the entire document as context without necessarily parsing the entire thing? (Like maybe the Prodigy UI shows the entire document text but only parses the paragraph containing the candidate annotation. I’m not sure how to do this.)