Prodigy crashes on large documents

I have a corpus of 15 documents, average 21,936 tokens per document. list(nlp.pipe([texts])) completes in about two minutes. I’m trying to run my corpus through ner.teach. Having the context of the entire document is helpful for the annotator, so I removed the

stream = split_sentences(model.orig_nlp, stream)

line from the teach recipe.

Prodigy can’t handle this. I launch the web interface and see “Loading…” for about 15 minutes and then it crashes and the web interface reads “Error”. The debug log looks like this:

15:01:09 - DB: Initialising database SQLite
15:01:09 - DB: Connecting to database SQLite
15:01:10 - RECIPE: Loading recipe from file
15:01:10 - RECIPE: Calling recipe 'ner.teach'
15:01:10 - RECIPE: Starting recipe ner.teach
15:01:10 - LOADER: Using file extension 'jsonl' to find loader
15:01:10 - LOADER: Loading stream from jsonl
15:01:10 - LOADER: Rehashing stream
15:01:24 - RECIPE: Creating EntityRecognizer using model en
15:01:37 - MODEL: Added sentence boundary detector to model pipeline
15:01:37 - MODEL: Loading match patterns from disk
15:01:37 - MODEL: Adding 4 patterns
15:01:37 - MODEL: Ensure pattern labels are added to EntityRecognizer
15:01:37 - RECIPE: Created PatternMatcher and loded in patterns
15:01:37 - SORTER: Resort stream to prefer uncertain scores (bias 0.0)
15:01:37 - CONTROLLER: Initialising from recipe
15:01:37 - DB: Creating dataset 'full-doc-effective-date'
15:01:37 - DB: Loading dataset 'full-doc-effective-date' (0 examples)
15:01:37 - DB: Creating dataset '2018-01-15_15-48-37'
15:01:01 - GET: /project
15:01:02 - GET: /get_questions
15:01:02 - CONTROLLER: Iterating over stream
15:01:02 - FILTER: Filtering duplicates from stream
15:01:02 - MODEL: Predicting spans for batch (batch size 64)

Another time I tried the same corpus, Prodigy did manage to parse everything and start presenting me with documents to annotate, but only after an hour or so. And every time it recalculated new candidates it would take another hour. This is running on an average MacBook Pro. Activity Monitor shows huge CPU usage.

I figure your focus has been on annotating short texts because that fits with Prodigy’s “make lots of simple decisions quickly” philosophy. I wouldn’t be surprised if my 21,000 token documents were pushing the system beyond its expected limits, but I want to understand exactly what is going on here.

  1. Are you surprised that Prodigy crashes in this scenario?
  2. Why would Prodigy crash when analyzing this corpus when nlp.pipe() can handle the whole thing just fine?
  3. Any advice for how to display the entire document as context without necessarily parsing the entire thing? (Like maybe the Prodigy UI shows the entire document text but only parses the paragraph containing the candidate annotation. I’m not sure how to do this.)

I think the problem isn’t that all documents are processed at once – internally, Prodigy works with generators to process the stream, so examples are processed as they come in.

It seems like the difficulty here is that ner.teach uses beam search, so it’s trying to find the best parses for each 20k token document – instead of only the one best parse like ner.pipe. At each step, it has a beam of states, and has to create a new state on the next token. That state representation involves a copy that’s sensitive to the line length. On normal-length document, this is such a small overhead that in practice, the time is still linear. But with 20k token documents, the time complexity becomes non-linear, because the copy starts to dominate.

In terms of accuracy, beam search also won’t perform as well on long documents, because the number of candidates per word that you’re considering (relative to the whole document) is very small. This is also one of the reasons Prodigy tries to keep the task text small (in addition to the advantages for the human annotator, like keeping focused and moving fast).

When you mentioned that the full context is important, I didn’t expect it to be that much context. I know that legal texts are pretty tricky in that respect – but the problem is, if it really is true that the annotator can’t make the decision from one or two sentences, the model is also much less likely to learn anything meaningful from the annotations.

Some ideas for solutions:

  • For your use case, it might be better to start off collecting annotations with ner.match, which only uses the pattern matcher, and will be much faster. If there’s a match, you could also truncate the text around the match, to at least exclude some parts of the full document.
  • Once you’ve collected a bunch of annotations from the patterns, you can pre-train a model, parse the text with spaCy, extract the predictions and annotate them statically using mark.
  • Maybe you can think of a creative way to pre-process your documents to shorten them, or remove text that you can definitely exclude?
  • This could even mean training a model to help you with pre-processing or shortening the documents. For example, a per-sentence text classifier This is actually not a weird workflow at all, and chaining models like this is something we often recommend for more complex use cases.
1 Like